首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
周林飞  姚雪  芦晓峰 《资源科学》2016,38(8):1538-1549
BP神经网络因具有自学习、自适应、大规模并行处理等特点而广泛应用于遥感影像分类中,但是该方法训练时容易陷入局部极小值,且收敛速度较慢,针对这些不足提出一种基于相容粗糙集的BP神经网络分类方法。本文以双台子河口湿地为研究对象,以Landsat-8 OLI影像为数据基础,利用相容粗糙集理论对样本数据集进行预处理,将得到的数据作为新的训练样本,在Matlab软件平台下建立BP神经网络的湿地覆被分类模型,进行湿地覆被信息提取,将分类结果与单纯的BP神经网络以及粗糙集样本属性约简预处理的分类结果进行比较分析。结果表明,基于相容粗糙集的BP神经网络分类方法可以剔除训练样本中的噪声数据,提高网络的训练成功率,缩短网络的收敛时间,分类效果较好,其总体精度达到91.25%,Kappa系数为0.8969,比单纯的BP神经网络分类结果高7.92%和0.0926,比粗糙集样本属性约简预处理方法的分类结果高3.03%和0.0357,是一种有效的湿地覆被分类方法。  相似文献   

2.
Imbalanced sample distribution is usually the main reason for the performance degradation of machine learning algorithms. Based on this, this study proposes a hybrid framework (RGAN-EL) combining generative adversarial networks and ensemble learning method to improve the classification performance of imbalanced data. Firstly, we propose a training sample selection strategy based on roulette wheel selection method to make GAN pay more attention to the class overlapping area when fitting the sample distribution. Secondly, we design two kinds of generator training loss, and propose a noise sample filtering method to improve the quality of generated samples. Then, minority class samples are oversampled using the improved RGAN to obtain a balanced training sample set. Finally, combined with the ensemble learning strategy, the final training and prediction are carried out. We conducted experiments on 41 real imbalanced data sets using two evaluation indexes: F1-score and AUC. Specifically, we compare RGAN-EL with six typical ensemble learning; RGAN is compared with three typical GAN models. The experimental results show that RGAN-EL is significantly better than the other six ensemble learning methods, and RGAN is greatly improved compared with three classical GAN models.  相似文献   

3.
Either traditional learning methods or deep learning methods have been widely applied for the early Alzheimer’s disease (AD) diagnosis, but these methods often suffer from the issue of training set bias and have no interpretability. To address these issues, this paper proposes a two-phase framework to iteratively assign weights to samples and features. Specifically, the first phase automatically distinguishes clean samples from training samples. Training samples are regarded as noisy data and thus should be assigned different weights for penalty, while clean samples are of high quality and thus are used to learn the feature weights. In the second phase, our method iteratively assigns sample weights to the training samples and feature weights to the clean samples. Moreover, their updates are iterative so that the proposed framework deals with the training set bias issue as well as contains interpretability on both samples and features. Experimental results on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset show that our method achieves the best classification performance in terms of binary classification tasks and has better interpretability, compared to the state-of-the-art methods.  相似文献   

4.
Schema matching is the problem of finding correspondences (mapping rules, e.g. logical formulae) between heterogeneous schemas e.g. in the data exchange domain, or for distributed IR in federated digital libraries. This paper introduces a probabilistic framework, called sPLMap, for automatically learning schema mapping rules, based on given instances of both schemas. Different techniques, mostly from the IR and machine learning fields, are combined for finding suitable mapping candidates. Our approach gives a probabilistic interpretation of the prediction weights of the candidates, selects the rule set with highest matching probability, and outputs probabilistic rules which are capable to deal with the intrinsic uncertainty of the mapping process. Our approach with different variants has been evaluated on several test sets.  相似文献   

5.
针对钢板表面缺陷图像分类传统深度学习算法中需要大量标签数据的问题,提出一种基于主动学习的高效分类方法。该方法包含一个轻量级的卷积神经网络和一个基于不确定性的主动学习样本筛选策略。神经网络采用简化的convolutional base进行特征提取,然后用全局池化层替换掉传统密集连接分类器中的隐藏层来减轻过拟合。为了更好的衡量模型对未标签图像样本所属类别的不确定性,首先将未标签图像样本传入到用标签图像样本训练好的模型,得到模型对每一个未标签样本关于标签的概率分布(probability distribution over classes, PDC),然后用此模型对标签样本进行预测并得到模型对每个标签的平均PDC。将两类分布的KL-divergence值作为不确定性指标来筛选未标签图像进行人工标注。根据在NEU-CLS开源缺陷数据集上的对比实验,该方法可以通过44%的标签数据实现97%的准确率,极大降低标注成本。  相似文献   

6.
程雅倩  黄玮  金晓祥  贾佳 《情报科学》2022,39(2):155-161
【目的/意义】由于自媒体平台中的多标签文本具有高维性和不平衡性,导致文本分类效果较差,因此通过 研究5G环境下高校图书馆自媒体平台多标签文本分类方法对解决该问题具有重要意义。【方法/过程】本文首先通 过对采集的5G环境下高校图书馆自媒体平台多标签文本进行预处理,包括无意义数据去除、文本分词以及去停用 词等;然后采用改进主成分分析方法进行多标签文本降维处理,利用向量空间模型实现文本平衡化处理;最后以处 理后的文本为基础,采用Adaboost和SVM两种算法构建文本分类器,实现多标签文本分类。【结果/结论】实验结果 表明,本文拟定的自媒体平台标签文本分类方法可以使汉明损失降低,F1值提高,多标签文本分类效果好,且耗时 较低,具有可靠性。【创新/局限】由于本研究中的数据集数量不够多,所以在测试和验证方面,得出的结果具有一定 局限性。因此在未来研究中期望利用更为丰富的数据库,对所设计的方法做出进一步的改进与创新。  相似文献   

7.
黄静  薛书田  肖进 《软科学》2017,(7):131-134
将半监督学习技术与多分类器集成模型Bagging相结合,构建类别分布不平衡环境下基于Bagging的半监督集成模型(SSEBI),综合利用有、无类别标签的样本来提高模型的性能.该模型主要包括三个阶段:(1)从无类别标签数据集中选择性标记一部分样本并训练若干个基本分类器;(2)使用训练好的基本分类器对测试集样本进行分类;(3)对分类结果进行集成得到最终分类结果.在五个客户信用评估数据集上进行实证分析,结果表明本研究提出的SSEBI模型的有效性.  相似文献   

8.
IT项目是高投入、高风险产业,为了在IT项目开发中获得高收益,有必要对IT项目风险进行分析。但是传统分析方法比较主观,难以对IT项目风险进行客观的分析。文章先引入粗糙集理论,利用可分辨矩阵约简算法对IT项目风险表进行属性约简,剔除其中不必要的属性,然后综合利用粗糙集理论和AHP方法来确定IT项目风险的权重,并通过实例验证该方法的可行性和合理性。  相似文献   

9.
高亚琪  王昊  刘渊晨 《情报科学》2021,39(10):107-117
【目的/意义】针对当前利用计算机管理图像资源存在图像语义特征表达不足等问题,探索和分析了特征及 特征融合对分类结果的影响,提出了一种提高图像语义分类准确率的方法。【方法/过程】本文定义了四种图像风 格,将图像描述特征划分为三个层次,探究特征融合的特点,寻求能有效表达图像语义的特征。分别采用SVM、 CNN、LSTM 及迁移学习方法实现图像风格分类,并将算法组合以提高分类效果。【结果/结论】基于迁移学习的 ResNet18模型提取的深层特征能够较好地表达图像的高级语义,将其与SVM结合能提高分类准确率。特征之间 并不总是互补,在特征选择时应避免特征冗余,造成分类效率下降。【创新/局限】本文定义的风格数目较少,且图像 展示出的风格并不绝对,往往可以被赋予多种标签,今后应进一步丰富图像数据集并尝试进行多标签分类。  相似文献   

10.
Dynamic Ensemble Selection (DES) strategy is one of the most common and effective techniques in machine learning to deal with classification problems. DES systems aim to construct an ensemble consisting of the most appropriate classifiers selected from the candidate classifier pool according to the competence level of the individual classifier. Since several classifiers are selected, their combination becomes crucial. However, most of current DES approaches focus on the combination of the selected classifiers while ignoring the local information surrounding the query sample needed to be classified. In order to boost the performance of DES-based classification systems, we in this paper propose a dynamic weighting framework for the classifier fusion during obtaining the final output of an DES system. In particular, the proposed method first employs a DES approach to obtain a group of classifiers for a query sample. Then, the hypothesis vector of the selected ensemble is obtained based on the analysis of consensus. Finally, a distance-based weighting scheme is developed to adjust the hypothesis vector depending on the closeness of the query sample to each class. The proposed method is tested on 30 real-world datasets with six well-known DES approaches based on both homogeneous and heterogeneous ensemble. The obtained results, supported by proper statistical tests, show that our method outperforms, both in terms of accuracy and kappa measures, the original DES framework.  相似文献   

11.
The aim in multi-label text classification is to assign a set of labels to a given document. Previous classifier-chain and sequence-to-sequence models have been shown to have a powerful ability to capture label correlations. However, they rely heavily on the label order, while labels in multi-label data are essentially an unordered set. The performance of these approaches is therefore highly variable depending on the order in which the labels are arranged. To avoid being dependent on label order, we design a reasoning-based algorithm named Multi-Label Reasoner (ML-Reasoner) for multi-label classification. ML-Reasoner employs a binary classifier to predict all labels simultaneously and applies a novel iterative reasoning mechanism to effectively utilize the inter-label information, where each instance of reasoning takes the previously predicted likelihoods for all labels as additional input. This approach is able to utilize information between labels, while avoiding the issue of label-order sensitivity. Extensive experiments demonstrate that our method outperforms state-of-the art approaches on the challenging AAPD dataset. We also apply our reasoning module to a variety of strong neural-based base models and show that it is able to boost performance significantly in each case.  相似文献   

12.
Most existing research on applying machine learning techniques to document summarization explores either classification models or learning-to-rank models. This paper presents our recent study on how to apply a different kind of learning models, namely regression models, to query-focused multi-document summarization. We choose to use Support Vector Regression (SVR) to estimate the importance of a sentence in a document set to be summarized through a set of pre-defined features. In order to learn the regression models, we propose several methods to construct the “pseudo” training data by assigning each sentence with a “nearly true” importance score calculated with the human summaries that have been provided for the corresponding document set. A series of evaluations on the DUC data sets are conducted to examine the efficiency and the robustness of the proposed approaches. When compared with classification models and ranking models, regression models are consistently preferable.  相似文献   

13.
通过分析Pawlak粗糙集模型在数据挖掘中应用的局限性,提出了一种基于变精度粗糙集模型的数据挖掘方法。在数据挖掘中采用变精度粗糙集方法对胶合板缺陷数据进行属性约简和规则提取,并将所得规则用于分类。结果表明:变精度粗糙集改进了Pawlak粗糙集的不足,具有更高的可靠性和鲁棒性。  相似文献   

14.
Vocabulary mining in information retrieval refers to the utilization of the domain vocabulary towards improving the user’s query. Most often queries posed to information retrieval systems are not optimal for retrieval purposes. Vocabulary mining allows one to generalize, specialize or perform other kinds of vocabulary-based transformations on the query in order to improve retrieval performance. This paper investigates a new framework for vocabulary mining that derives from the combination of rough sets and fuzzy sets. The framework allows one to use rough set-based approximations even when the documents and queries are described using weighted, i.e., fuzzy representations. The paper also explores the application of generalized rough sets and the variable precision models. The problem of coordination between multiple vocabulary views is also examined. Finally, a preliminary analysis of issues that arise when applying the proposed vocabulary mining framework to the Unified Medical Language System (a state-of-the-art vocabulary system) is presented. The proposed framework supports the systematic study and application of different vocabulary views in information retrieval.  相似文献   

15.
Graph Convolutional Networks (GCNs) have been established as a fundamental approach for representation learning on graphs, based on convolution operations on non-Euclidean domain, defined by graph-structured data. GCNs and variants have achieved state-of-the-art results on classification tasks, especially in semi-supervised learning scenarios. A central challenge in semi-supervised classification consists in how to exploit the maximum of useful information encoded in the unlabeled data. In this paper, we address this issue through a novel self-training approach for improving the accuracy of GCNs on semi-supervised classification tasks. A margin score is used through a rank-based model to identify the most confident sample predictions. Such predictions are exploited as an expanded labeled set in a second-stage training step. Our model is suitable for different GCN models. Moreover, we also propose a rank aggregation of labeled sets obtained by different GCN models. The experimental evaluation considers four GCN variations and traditional benchmarks extensively used in the literature. Significant accuracy gains were achieved for all evaluated models, reaching results comparable or superior to the state-of-the-art. The best results were achieved for rank aggregation self-training on combinations of the four GCN models.  相似文献   

16.
This paper proposes a new method which embeds a reject option in twin support vector machine (RO-TWSVM) through the Receiver Operating Characteristic (ROC) curve for binary classification. The proposed RO-TWSVM enhances the classification robustness through inclusion of an effective rejection rule for potentially misclassified samples. The method is formulated based on a cost-sensitive framework which follows the principle of minimization of the expected cost of classification. Extensive experiments are conducted on synthetic and real-world data sets to compare the proposed RO-TWSVM with the original TWSVM without a reject option (TWSVM-without-RO) and the existing SVM with a reject option (RO-SVM). The experimental results demonstrate that our RO-TWSVM significantly outperforms TWSVM-without-RO, and in general, performs better than RO-SVM.  相似文献   

17.
The class distribution of imbalanced data sets is skewed in practical application. As traditional clustering methods mainly are designed for improving the overall learning performance, the majority class usually tends to be clustered and the minority class which is more valuable maybe ignored. Moreover, existing clustering methods can be limited for the performance of imbalanced and high-dimensional domains. In this paper, we present one-step spectral rotation clustering for imbalanced high-dimensional data (OSRCIH) by integrating self-paced learning and spectral rotation clustering in a unified learning framework, where sample selection and dimensionality reduction are simultaneously considered with mutual and iterative update. Specifically, the imbalance problem is considered by selecting the same number of training samples from each intrinsic group of the training data, where the sample-weight vector is obtained by self-paced learning. Moreover, dimensionality reduction is conducted by combining subspace learning and feature selection. Experimental analysis on synthetic datasets and real datasets showed that OSRCIH could recognize and enhance the weight of important samples and features so as to avoid the clustering method in favor of the majority class and to improve effectively the clustering performance.  相似文献   

18.
田志军  李芳芳 《科技通报》2012,28(2):134-136
提出基于差别矩阵的改进启发式粗糙集属性约简算法,降低算法的时间复杂度及空间复杂度。针对相关实际案例数据。以学生考试成绩为分析对象,应用改进后的约简算法,设计相应的评价指标,分析学生成绩的潜在影响因素,再次验证了本文提出算法的可行性和高效性。  相似文献   

19.
We propose a CNN-BiLSTM-Attention classifier to classify online short messages in Chinese posted by users on government web portals, so that a message can be directed to one or more government offices. Our model leverages every bit of information to carry out multi-label classification, to make use of different hierarchical text features and the labels information. In particular, our designed method extracts label meaning, the CNN layer extracts local semantic features of the texts, the BiLSTM layer fuses the contextual features of the texts and the local semantic features, and the attention layer selects the most relevant features for each label. We evaluate our model on two public large corpuses, and our high-quality handcraft e-government multi-label dataset, which is constructed by the text annotation tool doccano and consists of 29920 data points. Experimental results show that our proposed method is effective under common multi-label evaluation metrics, achieving micro-f1 of 77.22%, 84.42%, 87.52%, and marco-f1 of 77.68%, 73.37%, 83.57% on these three datasets respectively, confirming that our classifier is robust. We conduct ablation study to evaluate our label embedding method and attention mechanism. Moreover, case study on our handcraft e-government multi-label dataset verifies that our model integrates all types of semantic information of short messages based on different labels to achieve text classification.  相似文献   

20.
Machine learning applications must continually utilize label information from the data stream to detect concept drift and adapt to the dynamic behavior. Due to the computational expensiveness of label information, it is impractical to assume that the data stream is fully labeled. Therefore, much research focusing on semi-supervised concept drift detection has been proposed. Despite the large research effort in the literature, there is a lack of analysis on the information resources required with the achievable concept drift detection accuracy. Hence, this paper aims to answer the unexplored research question of “How many labeled samples are required to detect concept drift accurately?” by proposing an analytical framework to analyze and estimate the information resources required to detect concept drift accurately. Specifically, this paper disintegrates the distribution-based concept drift detection task into a learning task and a dissimilarity measurement task for independent analyses. The analyses results are then correlated to estimate the required number of labels within a set of data samples to detect concept drift accurately. The proximity of the information resources estimation is evaluated empirically, where the results suggest that the estimation is accurate with high amount of information resources provided. Additionally, estimation results of a state-of-the-art method and a benchmark data set are reported to show the applicability of the estimation by proposed analytical framework within benchmarked environments. In general, the estimation from the proposed analytical framework can serve as guidance in designing systems with limited information resources. This paper also hopes to assist in identifying research gaps and inspiring new research ideas regarding the analysis of the amount of information resources required for accurate concept drift detection.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号