首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Imbalanced sample distribution is usually the main reason for the performance degradation of machine learning algorithms. Based on this, this study proposes a hybrid framework (RGAN-EL) combining generative adversarial networks and ensemble learning method to improve the classification performance of imbalanced data. Firstly, we propose a training sample selection strategy based on roulette wheel selection method to make GAN pay more attention to the class overlapping area when fitting the sample distribution. Secondly, we design two kinds of generator training loss, and propose a noise sample filtering method to improve the quality of generated samples. Then, minority class samples are oversampled using the improved RGAN to obtain a balanced training sample set. Finally, combined with the ensemble learning strategy, the final training and prediction are carried out. We conducted experiments on 41 real imbalanced data sets using two evaluation indexes: F1-score and AUC. Specifically, we compare RGAN-EL with six typical ensemble learning; RGAN is compared with three typical GAN models. The experimental results show that RGAN-EL is significantly better than the other six ensemble learning methods, and RGAN is greatly improved compared with three classical GAN models.  相似文献   

2.
Authorship analysis of electronic texts assists digital forensics and anti-terror investigation. Author identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors or there is a significant variation in the text-length among the available training texts of the candidate authors. Moreover, in this task usually there is no similarity between the distribution of training and test texts over the classes, that is, a basic assumption of inductive learning does not apply. In this paper, we present methods to handle imbalanced multi-class textual datasets. The main idea is to segment the training texts into text samples according to the size of the class, thus producing a fairer classification model. Hence, minority classes can be segmented into many short samples and majority classes into less and longer samples. We explore text sampling methods in order to construct a training set according to a desirable distribution over the classes. Essentially, by text sampling we provide new synthetic data that artificially increase the training size of a class. Based on two text corpora of two languages, namely, newswire stories in English and newspaper reportage in Arabic, we present a series of authorship identification experiments on various multi-class imbalanced cases that reveal the properties of the presented methods.  相似文献   

3.
In text classification, it is necessary to perform feature selection to alleviate the curse of dimensionality caused by high-dimensional text data. In this paper, we utilize class term frequency (CTF) and class document frequency (CDF) to characterize the relevance between terms and categories in the level of term frequency (TF) and document frequency (DF). On the basis of relevance measurement above, three feature selection methods (ADF based on CTF (ADF-CTF), ADF based on CDF (ADF-CDF), and ADF based on both CTF and CDF (ADF-CTDF)) are proposed to identify relevant and discriminant terms by introducing absolute deviation factors (ADFs). Absolute deviation, a statistic concept, is first adopted to measure the relevance divergence characterized by CTF and CDF. In addition, ADF-CTF and ADF-CDF can be combined with existing DF-based and TF-based methods, respectively, which results in new ADF-based methods. Experimental results on six high-dimensional textual datasets using three classifiers indicate that ADF-based methods outperform original DF-based and TF-based ones in 89% cases in terms of Micro-F1 and Macro-F1, which demonstrates the role of ADF integrated in existing methods to boost the classification performance. In addition, findings also show that ADF-CTDF ranks first averagely among multiple datasets and significantly outperforms other methods in 99% cases.  相似文献   

4.
Many problems in data mining involve datasets with multiple views where the feature space consists of multiple feature groups. Previous studies employed view weighting method to find a shared cluster structure underneath different views. However, most of these studies applied gradient optimization method to optimize the cluster centroids and feature weights iteratively and made the final partition local optimal. In this work, we proposed a novel bi-level weighted multi-view clustering method with emphasizing fuzzy weighting on both view and feature. Furthermore, an efficient global search strategy that combines particle swarm optimization and gradient optimization was proposed to solve the induced non-convex loss function. In the experimental analysis, the performance of the proposed method was compared with five state-of-the-art weighted clustering algorithms on three real-world high-dimensional multi-view datasets.  相似文献   

5.
In recent years, mainly the functionality of services are described in a short natural text language. Keyword-based searching for web service discovery is not efficient for providing relevant results. When services are clustered according to the similarity, then it reduces search space and due to that search time is also reduced in the web service discovery process. So in the domain of web service clustering, basically topic modeling techniques like Latent Dirichlet Allocation (LDA), Correlated Topic Model (CTM), Hierarchical Dirichlet Processing (HDP), etc. are adopted for dimensionality reduction and feature representation of services in vector space. But as the services are described in the form of short text, so these techniques are not efficient due to lack of occurring words, limited content, etc. In this paper, the performance of web service clustering is evaluated by applying various topic modeling techniques with different clustering algorithms on the crawled dataset from ProgrammableWeb repository. Gibbs Sampling algorithm for Dirichlet Multinomial Mixture (GSDMM) model is proposed as a dimensionality reduction and feature representation of services to overcome the limitations of short text clustering. Results show that GSDMM with K-Means or Agglomerative clustering is outperforming all other methods. The performance of clustering is evaluated based on three extrinsic and two intrinsic evaluation criteria. Dimensionality reduction achieved by GSDMM is 90.88%, 88.84%, and 93.13% on three real-time crawled datasets, which is satisfactory as the performance of clustering is also enhanced by deploying this technique.  相似文献   

6.
The knowledge contained in academic literature is interesting to mine. Inspired by the idea of molecular markers tracing in the field of biochemistry, three named entities, namely, methods, datasets, and metrics, are extracted and used as artificial intelligence (AI) markers for AI literature. These entities can be used to trace the research process described in the bodies of papers, which opens up new perspectives for seeking and mining more valuable academic information. Firstly, the named entity recognition model is used to extract AI markers from large-scale AI literature. A multi-stage self-paced learning strategy (MSPL) is proposed to address the negative influence of hard and noisy samples on the model training. Secondly, original papers are traced for AI markers. Statistical and propagation analyses are performed based on the tracing results. Finally, the co-occurrences of AI markers are used to achieve clustering. The evolution within method clusters is explored. The above-mentioned mining based on AI markers yields many significant findings. For example, the propagation rate of the datasets gradually increases. The methods proposed by China in recent years have an increasing influence on other countries.  相似文献   

7.
Learning from imbalanced datasets is difficult. The insufficient information that is associated with the minority class impedes making a clear understanding of the inherent structure of the dataset. Most existing classification methods tend not to perform well on minority class examples when the dataset is extremely imbalanced, because they aim to optimize the overall accuracy without considering the relative distribution of each class. In this paper, we study the performance of SVMs, which have gained great success in many real applications, in the imbalanced data context. Through empirical analysis, we show that SVMs may suffer from biased decision boundaries, and that their prediction performance drops dramatically when the data is highly skewed. We propose to combine an integrated sampling technique, which incorporates both over-sampling and under-sampling, with an ensemble of SVMs to improve the prediction performance. Extensive experiments show that our method outperforms individual SVMs as well as several other state-of-the-art classifiers.  相似文献   

8.
One of the important problems in text classification is the high dimensionality of the feature space. Feature selection methods are used to reduce the dimensionality of the feature space by selecting the most valuable features for classification. Apart from reducing the dimensionality, feature selection methods have potential to improve text classifiers’ performance both in terms of accuracy and time. Furthermore, it helps to build simpler and as a result more comprehensible models. In this study we propose new methods for feature selection from textual data, called Meaning Based Feature Selection (MBFS) which is based on the Helmholtz principle from the Gestalt theory of human perception which is used in image processing. The proposed approaches are extensively evaluated by their effect on the classification performance of two well-known classifiers on several datasets and compared with several feature selection algorithms commonly used in text mining. Our results demonstrate the value of the MBFS methods in terms of classification accuracy and execution time.  相似文献   

9.
Multi-label classification (MLC) has attracted many researchers in the field of machine learning as it has a straightforward problem statement with varied solution approaches. Multi-label classifiers predict multiple labels for a single instance. The problem becomes challenging with the increasing number of features, especially when there are many features and labels which depend on each other. It requires dimensionality reduction before applying any multi-label learning method. This paper introduces a method named FS-MLC (Feature Selection forMulti-Label classification using Clustering in feature-space). It is a wrapper feature selection method that uses clustering to find the similarity among features and example-based precision and recall as the metrics for feature rankings to improve the performance of the associated classifier in terms of sample-based measures. First, clusters are created for features considering them as instances then features from different clusters are selected as the representative of all the features for that cluster. It reduces the number of features as a single feature represents multiple features within a cluster. It neither requires any parameter tuning nor the user threshold for the number of features selected. Extensive experimentation is performed to evaluate the efficacy of these reduced features using nine benchmark MLC datasets on twelve performance measures. The results show an impressive improvement in terms of sample-based precision, recall, and f1-score with up to 23%-93% discarded features.  相似文献   

10.
Unsupervised feature selection is very attractive in many practical applications, as it needs no semantic labels during the learning process. However, the absence of semantic labels makes the unsupervised feature selection more challenging, as the method can be affected by the noise, redundancy, or missing in the originally extracted features. Currently, most methods either consider the influence of noise for sparse learning or think over the internal structure information of the data, leading to suboptimal results. To relieve these limitations and improve the effectiveness of unsupervised feature selection, we propose a novel method named Adaptive Dictionary and Structure Learning (ADSL) that conducts spectral learning and sparse dictionary learning in a unified framework. Specifically, we adaptively update the dictionary based on sparse dictionary learning. And, we also introduce the spectral learning method of adaptive updating affinity matrix. While removing redundant features, the intrinsic structure of the original data can be retained. In addition, we adopt matrix completion in our framework to make it competent for fixing the missing data problem. We validate the effectiveness of our method on several public datasets. Experimental results show that our model not only outperforms some state-of-the-art methods on complete datasets but also achieves satisfying results on incomplete datasets.  相似文献   

11.
Text documents usually contain high dimensional non-discriminative (irrelevant and noisy) terms which lead to steep computational costs and poor learning performance of text classification. One of the effective solutions for this problem is feature selection which aims to identify discriminative terms from text data. This paper proposes a method termed “Hebb rule based feature selection (HRFS)”. HRFS is based on supervised Hebb rule and assumes that terms and classes are neurons and select terms under the assumption that a term is discriminative if it keeps “exciting” the corresponding classes. This assumption can be explained as “a term is highly correlated with a class if it is able to keep “exciting” the class according to the original Hebb postulate. Six benchmarking datasets are used to compare HRFS with other seven feature selection methods. Experimental results indicate that HRFS is effective to achieve better performance than the compared methods. HRFS can identify discriminative terms in the view of synapse between neurons. Moreover, HRFS is also efficient because it can be described in the view of matrix operation to decrease complexity of feature selection.  相似文献   

12.
With the popularity of social platforms such as Sina Weibo, Tweet, etc., a large number of public events spread rapidly on social networks and huge amount of textual data are generated along with the discussion of netizens. Social text clustering has become one of the most critical methods to help people find relevant information and provides quality data for subsequent timely public opinion analysis. Most existing neural clustering methods rely on manual labeling of training sets and take a long time in the learning process. Due to the explosiveness and the large-scale of social media data, it is a challenge for social text data clustering to satisfy the timeliness demand of users. This paper proposes a novel unsupervised event-oriented graph clustering framework (EGC), which can achieve efficient clustering performance on large-scale datasets with less time overhead and does not require any labeled data. Specifically, EGC first mines the potential relations existing in social text data and transforms the textual data of social media into an event-oriented graph by taking advantage of graph structure for complex relations representation. Secondly, EGC uses a keyword-based local importance method to accurately measure the weights of relations in event-oriented graph. Finally, a bidirectional depth-first clustering algorithm based on the interrelations is proposed to cluster the nodes in event-oriented graph. By projecting the relations of the graph into a smaller domain, EGC achieves fast convergence. The experimental results show that the clustering performance of EGC on the Weibo dataset reaches 0.926 (NMI), 0.926 (AMI), 0.866 (ARI), which are 13%–30% higher than other clustering methods. In addition, the average query time of EGC clustered data is 16.7ms, which is 90% less than the original data.  相似文献   

13.
Deep hashing has been an important research topic for using deep learning to boost performance of hash learning. Most existing deep supervised hashing methods mainly focus on how to effectively preserve the similarity in hash coding solely depending on pairwise supervision. However, such pairwise similarity-preserving strategy cannot fully explore the semantic information in most cases, which results in information loss. To address this problem, this paper proposes a discriminative dual-stream deep hashing (DDDH) method, which integrates the pairwise similarity loss and the classification loss into a unified framework to take full advantage of label information. Specifically, the pairwise similarity loss aims to preserve the similarity and structural information of high-dimensional original data. Meanwhile, the designed classification loss can enlarge the margin between different classes which improves the discrimination of learned binary codes. Moreover, an effective optimization algorithm is employed to train the hash code learning framework in an end-to-end manner. The results of extensive experiments on three image datasets demonstrate that our method is superior to several state-of-the-art deep and non-deep hashing methods. Ablation studies and analysis further show the effectiveness of introducing the classification loss in the overall hash learning framework.  相似文献   

14.
Optimized extreme learning machine (OELM) has been shown to achieve high performance on classification problems due to its simple dual form. This paper presents a predictor-corrector affine scaling interior point method to exploit the dual problem of OELM. This method aims to combine a predictor step with a corrector step for determining the descent Newton direction. At each iteration, the predictor step focuses on the complementarity gap reduction and computes an affine scaling direction to estimate the extent of the reduction of complementarity gap, while the corrector step traces the central path towards the optimal solution by high order approximation, and computes the corresponding center direction. Then, the Newton direction is combined by using both two directions, and the iteration sequence of interior feasible points converges to the optimal solution. Extensive experimental evaluations on various benchmark datasets show that the proposed algorithms outperform other interior point-based or active set-based algorithms. Moreover, they are able to converge in fewer iterations, which are independent of kernel type, dataset size and dimensionality.  相似文献   

15.
Cluster analysis using multiple representations of data is known as multi-view clustering and has attracted much attention in recent years. The major drawback of existing multi-view algorithms is that their clustering performance depends heavily on hyperparameters which are difficult to set. In this paper, we propose the Multi-View Normalized Cuts (MVNC) approach, a two-step algorithm for multi-view clustering. In the first step, an initial partitioning is performed using a spectral technique. In the second step, a local search procedure is used to refine the initial clustering. MVNC has been evaluated and compared to state-of-the-art multi-view clustering approaches using three real-world datasets. Experimental results have shown that MVNC significantly outperforms existing algorithms in terms of clustering quality and computational efficiency. In addition to its superior performance, MVNC is parameter-free which makes it easy to use.  相似文献   

16.
This paper proposes a novel hierarchical learning strategy to deal with the data sparseness problem in semantic relation extraction by modeling the commonality among related classes. For each class in the hierarchy either manually predefined or automatically clustered, a discriminative function is determined in a top-down way. As the upper-level class normally has much more positive training examples than the lower-level class, the corresponding discriminative function can be determined more reliably and guide the discriminative function learning in the lower-level one more effectively, which otherwise might suffer from limited training data. In this paper, two classifier learning approaches, i.e. the simple perceptron algorithm and the state-of-the-art Support Vector Machines, are applied using the hierarchical learning strategy. Moreover, several kinds of class hierarchies either manually predefined or automatically clustered are explored and compared. Evaluation on the ACE RDC 2003 and 2004 corpora shows that the hierarchical learning strategy much improves the performance on least- and medium-frequent relations.  相似文献   

17.
Deep multi-view clustering (MVC) is to mine and employ the complex relationships among views to learn the compact data clusters with deep neural networks in an unsupervised manner. The more recent deep contrastive learning (CL) methods have shown promising performance in MVC by learning cluster-oriented deep feature representations, which is realized by contrasting the positive and negative sample pairs. However, most existing deep contrastive MVC methods only focus on the one-side contrastive learning, such as feature-level or cluster-level contrast, failing to integrating the two sides together or bringing in more important aspects of contrast. Additionally, most of them work in a separate two-stage manner, i.e., first feature learning and then data clustering, failing to mutually benefit each other. To fix the above challenges, in this paper we propose a novel joint contrastive triple-learning framework to learn multi-view discriminative feature representation for deep clustering, which is threefold, i.e., feature-level alignment-oriented and commonality-oriented CL, and cluster-level consistency-oriented CL. The former two submodules aim to contrast the encoded feature representations of data samples in different feature levels, while the last contrasts the data samples in the cluster-level representations. Benefiting from the triple contrast, the more discriminative representations of views can be obtained. Meanwhile, a view weight learning module is designed to learn and exploit the quantitative complementary information across the learned discriminative features of each view. Thus, the contrastive triple-learning module, the view weight learning module and the data clustering module with these fused features are jointly performed, so that these modules are mutually beneficial. The extensive experiments on several challenging multi-view datasets show the superiority of the proposed method over many state-of-the-art methods, especially the large improvement of 15.5% and 8.1% on Caltech-4V and CCV in terms of accuracy. Due to the promising performance on visual datasets, the proposed method can be applied into many practical visual applications such as visual recognition and analysis. The source code of the proposed method is provided at https://github.com/ShizheHu/Joint-Contrastive-Triple-learning.  相似文献   

18.
Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami’s method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy.  相似文献   

19.
The high-value patent identification (HVPI) and the standard-essential patent identification (SEPI) are two important issues in the fields of intellectual property and the standardization, respectively. Almost all the HVPI and the SEPI are based on the single-task learning. In this paper, we unify the HVPI and the SEPI in a multi-task learning framework in consideration of the mutual reinforcement of the two tasks. In our model, we extract the patent structured features and embed the patent textual features using the pre-training model. Given these features, we explore a multi-task learning based identification model to identify the high-value patents and the standard-essential patents. We evaluate our model by comparing with two state-of-the-art models on the 5 balanced datasets and 2 imbalanced datasets. The results show our multi-task learning based model outperforms significantly these single-tasking learning based models in the measurements: precision, recall, F1 and accuracy. On the balanced datasets, the average increments of measurements are 1.3%, 1.29%, 1.28% and 1.28% respectively. On the imbalanced datasets, the average increments of measurements are 2.24%, 1.62%, 1.75% and 0.66% respectively.  相似文献   

20.
Listwise learning to rank models, which optimize the ranking of a document list, are among the most widely adopted algorithms for finding and ranking relevant documents to user information needs. In this paper, we propose ListMAP, a new listwise learning to rank model with prior distribution that encodes the informativeness of training data and assigns different weights to training instances. The main intuition behind ListMAP is that documents in the training dataset do not have the same impact on training a ranking function. ListMAP formalizes the listwise loss function as a maximum a posteriori estimation problem in which the scoring function must be estimated such that the log probability of the predicted ranked list is maximized given a prior distribution on the labeled data. We provide a model for approximating the prior distribution parameters from a set of observation data. We implement the proposed learning to rank model using neural networks. We theoretically discuss and analyze the characteristics of the introduced model and empirically illustrate its performance on a number of benchmark datasets; namely MQ2007 and MQ2008 of the Letor 4.0 benchmark, Set 1 and Set 2 of the Yahoo! learning to rank challenge data set, and Microsoft 30k and Microsoft 10K datasets. We show that the proposed models are effective across different datasets in terms of information retrieval evaluation metrics NDCG and MRR at positions 1, 3, 5, 10, and 20.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号