首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.
Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.  相似文献   

2.
The problem of results merging in distributed information retrieval environments has gained significant attention the last years. Two generic approaches have been introduced in research. The first approach aims at estimating the relevance of the documents returned from the remote collections through ad hoc methodologies (such as weighted score merging, regression etc.) while the other is based on downloading all the documents locally, completely or partially, in order to calculate their relevance. Both approaches have advantages and disadvantages. Download methodologies are more effective but they pose a significant overhead on the process in terms of time and bandwidth. Approaches that rely solely on estimation on the other hand, usually depend on document relevance scores being reported by the remote collections in order to achieve maximum performance. In addition to that, regression algorithms, which have proved to be more effective than weighted scores merging algorithms, need a significant number of overlap documents in order to function effectively, practically requiring multiple interactions with the remote collections. The new algorithm that is introduced is based on adaptively downloading a limited, selected number of documents from the remote collections and estimating the relevance of the rest through regression methodologies. Thus it reconciles the above two approaches, combining their strengths, while minimizing their drawbacks, achieving the limited time and bandwidth overhead of the estimation approaches and the increased effectiveness of the download. The proposed algorithm is tested in a variety of settings and its performance is found to be significantly better than the former, while approximating that of the latter.  相似文献   

3.
Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms.  相似文献   

4.
Text clustering is a well-known method for information retrieval and numerous methods for classifying words, documents or both together have been proposed. Frequently, textual data are encoded using vector models so the corpus is transformed in to a matrix of terms by documents; using this representation text clustering generates groups of similar objects on the basis of the presence/absence of the words in the documents. An alternative way to work on texts is to represent them as a network where nodes are entities connected by the presence and distribution of the words in the documents. In this work, after summarising the state of the art of text clustering we will present a new network approach to textual data. We undertake text co-clustering using methods developed for social network analysis. Several experimental results will be presented to demonstrate the validity of the approach and the advantages of this technique compared to existing methods.  相似文献   

5.
Associative classification methods have been recently applied to various categorization tasks due to its simplicity and high accuracy. To improve the coverage for test documents and to raise classification accuracy, some associative classifiers generate a huge number of association rules during the mining step. We present two algorithms to increase the computational efficiency of associative classification: one to store rules very efficiently, and the other to increase the speed of rule matching, using all of the generated rules. Empirical results using three large-scale text collections demonstrate that the proposed algorithms increase the feasibility of applying associative classification to large-scale problems.  相似文献   

6.
The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.  相似文献   

7.
朱秀华 《现代情报》2009,29(5):163-165
针对信息挖掘中的网页自动分类问题,提出了一种基于向量空间模型和并联BP网络的分类方法。该网络由并行连接的多个子网络组成,每个子网络负责一类模式特征的提取,多个子网并行处理所有模式,将分类结果在总输出层表现出来。以因特网上旅游网页分类为例验证了该方法的有效性。  相似文献   

8.
This paper presents a semantically rich document representation model for automatically classifying financial documents into predefined categories utilizing deep learning. The model architecture consists of two main modules including document representation and document classification. In the first module, a document is enriched with semantics using background knowledge provided by an ontology and through the acquisition of its relevant terminology. Acquisition of terminology integrated to the ontology extends the capabilities of semantically rich document representations with an in depth-coverage of concepts, thereby capturing the whole conceptualization involved in documents. Semantically rich representations obtained from the first module will serve as input to the document classification module which aims at finding the most appropriate category for that document through deep learning. Three different deep learning networks each belonging to a different category of machine learning techniques for ontological document classification using a real-life ontology are used.Multiple simulations are carried out with various deep neural networks configurations, and our findings reveal that a three hidden layer feedforward network with 1024 neurons obtain the highest document classification performance on the INFUSE dataset. The performance in terms of F1 score is further increased by almost five percentage points to 78.10% for the same network configuration when the relevant terminology integrated to the ontology is applied to enrich document representation. Furthermore, we conducted a comparative performance evaluation using various state-of-the-art document representation approaches and classification techniques including shallow and conventional machine learning classifiers.  相似文献   

9.
This paper studies how to learn accurate ranking functions from noisy training data for information retrieval. Most previous work on learning to rank assumes that the relevance labels in the training data are reliable. In reality, however, the labels usually contain noise due to the difficulties of relevance judgments and several other reasons. To tackle the problem, in this paper we propose a novel approach to learning to rank, based on a probabilistic graphical model. Considering that the observed label might be noisy, we introduce a new variable to indicate the true label of each instance. We then use a graphical model to capture the joint distribution of the true labels and observed labels given features of documents. The graphical model distinguishes the true labels from observed labels, and is specially designed for ranking in information retrieval. Therefore, it helps to learn a more accurate model from noisy training data. Experiments on a real dataset for web search show that the proposed approach can significantly outperform previous approaches.  相似文献   

10.
[目的/意义]探索论文被引次数是否和论文内容即概念组合方式有关。[方法/过程]选取WoS数据库中的免疫学科,抽取其中高、中、低被引频次三种论文集合的主题词,分析各集合主题词频次分布的集中离散趋势。分别构建主题词共现网络,通过网络拓扑属性的分析,了解三种论文集合在概念组合方式上的异同,衡量非典型组合与新颖性的关系。[结果/结论](1)不同被引频次的文献集合在主题类型的分布和主题词分散程度上有较大差异。(2)高被引和中被引论文集的主题词共现网络具有小世界性,低被引论文集的主题词网络不具有小世界性。(3)高被引论文集的主题词共现网络比较紧密,且主题词非典型组合的比例要高于其他两种论文集。低被引论文集的主题词网络比较松散。论文的被引次数与其主题热度、主题之间联系密切程度以及主题之间组合方式相关。  相似文献   

11.
Text classification is an important research topic in natural language processing (NLP), and Graph Neural Networks (GNNs) have recently been applied in this task. However, in existing graph-based models, text graphs constructed by rules are not real graph data and introduce massive noise. More importantly, for fixed corpus-level graph structure, these models cannot sufficiently exploit the labeled and unlabeled information of nodes. Meanwhile, contrastive learning has been developed as an effective method in graph domain to fully utilize the information of nodes. Therefore, we propose a new graph-based model for text classification named CGA2TC, which introduces contrastive learning with an adaptive augmentation strategy into obtaining more robust node representation. First, we explore word co-occurrence and document word relationships to construct a text graph. Then, we design an adaptive augmentation strategy for the text graph with noise to generate two contrastive views that effectively solve the noise problem and preserve essential structure. Specifically, we design noise-based and centrality-based augmentation strategies on the topological structure of text graph to disturb the unimportant connections and thus highlight the relatively important edges. As for the labeled nodes, we take the nodes with same label as multiple positive samples and assign them to anchor node, while we employ consistency training on unlabeled nodes to constrain model predictions. Finally, to reduce the resource consumption of contrastive learning, we adopt a random sample method to select some nodes to calculate contrastive loss. The experimental results on several benchmark datasets can demonstrate the effectiveness of CGA2TC on the text classification task.  相似文献   

12.
In this paper we present a new algorithm for relevance feedback (RF) in information retrieval. Unlike conventional RF algorithms which use the top ranked documents for feedback, our proposed algorithm is a kind of active feedback algorithm which actively chooses documents for the user to judge. The objectives are (a) to increase the number of judged relevant documents and (b) to increase the diversity of judged documents during the RF process. The algorithm uses document-contexts by splitting the retrieval list into sub-lists according to the query term patterns that exist in the top ranked documents. Query term patterns include a single query term, a pair of query terms that occur in a phrase and query terms that occur in proximity. The algorithm is an iterative algorithm which takes one document for feedback in each of the iterations. We experiment with the algorithm using the TREC-6, -7, -8, -2005 and GOV2 data collections and we simulate user feedback using the TREC relevance judgements. From the experimental results, we show that our proposed split-list algorithm is better than the conventional RF algorithm and that our algorithm is more reliable than a similar algorithm using maximal marginal relevance.  相似文献   

13.
针对图书、期刊论文等数字文献文本特征较少而导致特征向量语义表达不够准确、分类效果差的问题,本文提出一种基于特征语义扩展的数字文献分类方法。该方法首先利用TF-IDF方法获取对数字文献文本表示能力较强、具有较高TF-IDF值的核心特征词;其次分别借助知网(Hownet)语义词典以及开放知识库维基百科(Wikipedia)对核心特征词集进行语义概念的扩展,以构建维度较低、语义丰富的概念向量空间;最后采用MaxEnt、SVM等多种算法构造分类器实现对数字文献的自动分类。实验结果表明:相比传统基于特征选择的短文本分类方法,该方法能有效地实现对短文本特征的语义扩展,提高数字文献分类的分类性能。  相似文献   

14.
郑凤萍 《现代情报》2007,27(3):143-144
文本提出了一种基于模糊向量空间模型和径向基函数网络的分类方法。该方法在特征提取时充分考虑了特征项在文档中的位置信息,构造出模糊特征向量,使自动分类更接近手工分类方法。以中国期刊网全文数据库部分文档数据为例验证了该方法的有效性。  相似文献   

15.
This study proposes a novel extended co-citation search technique, which is graph-based document retrieval on a co-citation network containing citation context information. The proposed search expands the scope of the target documents by repetitively spreading the relationship of co-citation in order to obtain relevant documents that are not identified by traditional co-citation searches. Specifically, this search technique is a combination of (a) applying a graph-based algorithm to compute the similarity score on a complicated network, and (b) incorporating co-citation contexts into the process of calculating similarity scores to reduce the negative effects of an increasing number of irrelevant documents. To evaluate the search performance of the proposed search, 10 proposed methods (five representative graph-based algorithms applied to co-citation networks weighted with/without contexts) are compared with two kinds of baselines (a traditional co-citation search with/without contexts) in information retrieval experiments based on two test collections (biomedicine and computer linguistic articles). The experiment results showed that the scores of the normalized discounted cumulative gain ([email protected]) of the proposed methods using co-citation contexts tended to be higher than those of the baselines. In addition, the combination of the random walk with restart (RWR) algorithm and the network weighted with contexts achieved the best search performance among the 10 proposed methods. Thus, it is clarified that the combination of graph-based algorithms and co-citation contexts are effective in improving the performance of co-citation search techniques, and that sole use of a graph-based algorithm is not enough to enhance search performances from the baselines.  相似文献   

16.
Semi-supervised anomaly detection methods leverage a few anomaly examples to yield drastically improved performance compared to unsupervised models. However, they still suffer from two limitations: 1) unlabeled anomalies (i.e., anomaly contamination) may mislead the learning process when all the unlabeled data are employed as inliers for model training; 2) only discrete supervision information (such as binary or ordinal data labels) is exploited, which leads to suboptimal learning of anomaly scores that essentially take on a continuous distribution. Therefore, this paper proposes a novel semi-supervised anomaly detection method, which devises contamination-resilient continuous supervisory signals. Specifically, we propose a mass interpolation method to diffuse the abnormality of labeled anomalies, thereby creating new data samples labeled with continuous abnormal degrees. Meanwhile, the contaminated area can be covered by new data samples generated via combinations of data with correct labels. A feature learning-based objective is added to serve as an optimization constraint to regularize the network and further enhance the robustness w.r.t. anomaly contamination. Extensive experiments on 11 real-world datasets show that our approach significantly outperforms state-of-the-art competitors by 20%–30% in AUC-PR and obtains more robust and superior performance in settings with different anomaly contamination levels and varying numbers of labeled anomalies.  相似文献   

17.
We propose a CNN-BiLSTM-Attention classifier to classify online short messages in Chinese posted by users on government web portals, so that a message can be directed to one or more government offices. Our model leverages every bit of information to carry out multi-label classification, to make use of different hierarchical text features and the labels information. In particular, our designed method extracts label meaning, the CNN layer extracts local semantic features of the texts, the BiLSTM layer fuses the contextual features of the texts and the local semantic features, and the attention layer selects the most relevant features for each label. We evaluate our model on two public large corpuses, and our high-quality handcraft e-government multi-label dataset, which is constructed by the text annotation tool doccano and consists of 29920 data points. Experimental results show that our proposed method is effective under common multi-label evaluation metrics, achieving micro-f1 of 77.22%, 84.42%, 87.52%, and marco-f1 of 77.68%, 73.37%, 83.57% on these three datasets respectively, confirming that our classifier is robust. We conduct ablation study to evaluate our label embedding method and attention mechanism. Moreover, case study on our handcraft e-government multi-label dataset verifies that our model integrates all types of semantic information of short messages based on different labels to achieve text classification.  相似文献   

18.
19.
20.
This paper proposes new modified methods for back propagation neural networks and uses semantic feature space to improve categorization performance and efficiency. The standard back propagation neural network (BPNN) has the drawbacks of slow learning and getting trapped in local minima, leading to a network with poor performance and efficiency. In this paper, we propose two methods to modify the standard BPNN and adopt the semantic feature space (SFS) method to reduce the number of dimensions as well as construct latent semantics between terms. The experimental results show that the modified methods enhanced the performance of the standard BPNN and were more efficient than the standard BPNN. The SFS method cannot only greatly reduce the dimensionality, but also enhances performance and can therefore be used to further improve text categorization systems precisely and efficiently.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号