首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 37 毫秒
1.
We propose a CNN-BiLSTM-Attention classifier to classify online short messages in Chinese posted by users on government web portals, so that a message can be directed to one or more government offices. Our model leverages every bit of information to carry out multi-label classification, to make use of different hierarchical text features and the labels information. In particular, our designed method extracts label meaning, the CNN layer extracts local semantic features of the texts, the BiLSTM layer fuses the contextual features of the texts and the local semantic features, and the attention layer selects the most relevant features for each label. We evaluate our model on two public large corpuses, and our high-quality handcraft e-government multi-label dataset, which is constructed by the text annotation tool doccano and consists of 29920 data points. Experimental results show that our proposed method is effective under common multi-label evaluation metrics, achieving micro-f1 of 77.22%, 84.42%, 87.52%, and marco-f1 of 77.68%, 73.37%, 83.57% on these three datasets respectively, confirming that our classifier is robust. We conduct ablation study to evaluate our label embedding method and attention mechanism. Moreover, case study on our handcraft e-government multi-label dataset verifies that our model integrates all types of semantic information of short messages based on different labels to achieve text classification.  相似文献   

2.
Text classification or categorization is the process of automatically tagging a textual document with most relevant labels or categories. When the number of labels is restricted to one, the task becomes single-label text categorization. However, the multi-label version is challenging. For Arabic language, both tasks (especially the latter one) become more challenging in the absence of large and free Arabic rich and rational datasets. Therefore, we introduce new rich and unbiased datasets for both the single-label (SANAD) as well as the multi-label (NADiA) Arabic text categorization tasks. Both corpora are made freely available to the research community on Arabic computational linguistics. Further, we present an extensive comparison of several deep learning (DL) models for Arabic text categorization in order to evaluate the effectiveness of such models on SANAD and NADiA. A unique characteristic of our proposed work, when compared to existing ones, is that it does not require a pre-processing phase and fully based on deep learning models. Besides, we studied the impact of utilizing word2vec embedding models to improve the performance of the classification tasks. Our experimental results showed solid performance of all models on SANAD corpus with a minimum accuracy of 91.18%, achieved by convolutional-GRU, and top performance of 96.94%, achieved by attention-GRU. As for NADiA, attention-GRU achieved the highest overall accuracy of 88.68% for a maximum subsets of 10 categories on “Masrawy” dataset.  相似文献   

3.
As a well-known multi-label classification method, the performance of ML-KNN may be affected by the uncertainty knowledge from samples. The rough set theory acts as an effective tool for data uncertainty analysis, which can identify the samples easy to cause misclassification in the learning process. In this paper, a hybrid framework by fusing rough sets with ML-KNN for multi-label learning is proposed, whose main idea is to depict easy misclassified samples by rough sets and to measure the discernibility of attributes for such samples. First, a rough set model titled NRFD_RS based on neighborhood relations and fuzzy decisions is proposed for multi-label data to find the heterogeneous sample pairs generated from the boundary regions of each label. Then, the weight of an attribute is defined by evaluating its discernibility to those heterogeneous sample pairs. Finally, a weighted HEOM distance is reconstructed and utilized to ML-KNN. Comprehensive experimental results with fourteen public multi-label data sets, including ten regular-scale and four larger-scale data sets, verify the effectiveness of the proposed framework relative to several state-of-the-art multi-label classification methods.  相似文献   

4.
Multi-label classification (MLC) has attracted many researchers in the field of machine learning as it has a straightforward problem statement with varied solution approaches. Multi-label classifiers predict multiple labels for a single instance. The problem becomes challenging with the increasing number of features, especially when there are many features and labels which depend on each other. It requires dimensionality reduction before applying any multi-label learning method. This paper introduces a method named FS-MLC (Feature Selection forMulti-Label classification using Clustering in feature-space). It is a wrapper feature selection method that uses clustering to find the similarity among features and example-based precision and recall as the metrics for feature rankings to improve the performance of the associated classifier in terms of sample-based measures. First, clusters are created for features considering them as instances then features from different clusters are selected as the representative of all the features for that cluster. It reduces the number of features as a single feature represents multiple features within a cluster. It neither requires any parameter tuning nor the user threshold for the number of features selected. Extensive experimentation is performed to evaluate the efficacy of these reduced features using nine benchmark MLC datasets on twelve performance measures. The results show an impressive improvement in terms of sample-based precision, recall, and f1-score with up to 23%-93% discarded features.  相似文献   

5.
Students use general web search engines as their primary source of research while trying to find answers to school-related questions. Although search engines are highly relevant for the general population, they may return results that are out of educational context. Another rising trend; social community question answering websites are the second choice for students who try to get answers from other peers online. We attempt discovering possible improvements in educational search by leveraging both of these information sources. For this purpose, we first implement a classifier for educational questions. This classifier is built by an ensemble method that employs several regular learning algorithms and retrieval based approaches that utilize external resources. We also build a query expander to facilitate classification. We further improve the classification using search engine results and obtain 83.5% accuracy. Although our work is entirely based on the Turkish language, the features could easily be mapped to other languages as well. In order to find out whether search engine ranking can be improved in the education domain using the classification model, we collect and label a set of query results retrieved from a general web search engine. We propose five ad-hoc methods to improve search ranking based on the idea that the query-document category relation is an indicator of relevance. We evaluate these methods for overall performance, varying query length and based on factoid and non-factoid queries. We show that some of the methods significantly improve the rankings in the education domain.  相似文献   

6.
In many important application domains, such as text categorization, scene classification, biomolecular analysis and medical diagnosis, examples are naturally associated with more than one class label, giving rise to multi-label classification problems. This fact has led, in recent years, to a substantial amount of research in multi-label classification. In order to evaluate and compare multi-label classifiers, researchers have adapted evaluation measures from the single-label paradigm, like Precision and Recall; and also have developed many different measures specifically for the multi-label paradigm, like Hamming Loss and Subset Accuracy. However, these evaluation measures have been used arbitrarily in multi-label classification experiments, without an objective analysis of correlation or bias. This can lead to misleading conclusions, as the experimental results may appear to favor a specific behavior depending on the subset of measures chosen. Also, as different papers in the area currently employ distinct subsets of measures, it is difficult to compare results across papers. In this work, we provide a thorough analysis of multi-label evaluation measures, and we give concrete suggestions for researchers to make an informed decision when choosing evaluation measures for multi-label classification.  相似文献   

7.
This paper proposes a new scheme for ensuring data consistency in unstructured p2p networks where peers can subscribe to multiple content types (identified by labels) and are rapidly informed of content updates. The idea is based on using a static tree structure, the Cluster-K+ tree, that maintains most of the structural information about peers and labels. A label denotes a set of replicated or co-related data in the network. The Cluster-K+ tree provides efficient retrieval, addition, deletion and consistent updates of labels. Our proposed structure guarantees a short response search time of O(H + K), where H denotes the height of the tree and K the degree of an internal tree node. We present theoretical analytic bounds for the worst-case performance. To verify the bounds, we also present experimental results obtained from a network simulation. The results demonstrate that the actual performance of our system is significantly better than the theoretical bounds.  相似文献   

8.
The goal of the study presented in this article is to investigate to what extent the classification of a web page by a single genre matches the users’ perspective. The extent of agreement on a single genre label for a web page can help understand whether there is a need for a different classification scheme that overrides the single-genre labelling. My hypothesis is that a single genre label does not account for the users’ perspective. In order to test this hypothesis, I submitted a restricted number of web pages (25 web pages) to a large number of web users (135 subjects) asking them to assign only a single genre label to each of the web pages. Users could choose from a list of 21 genre labels, or select one of the two ‘escape’ options, i.e. ‘Add a label’ and ‘I don’t know’. The rationale was to observe the level of agreement on a single genre label per web page, and draw some conclusions about the appropriateness of limiting the assignment to only a single label when doing genre classification of web pages. Results show that users largely disagree on the label to be assigned to a web page.  相似文献   

9.
程雅倩  黄玮  金晓祥  贾佳 《情报科学》2022,39(2):155-161
【目的/意义】由于自媒体平台中的多标签文本具有高维性和不平衡性,导致文本分类效果较差,因此通过 研究5G环境下高校图书馆自媒体平台多标签文本分类方法对解决该问题具有重要意义。【方法/过程】本文首先通 过对采集的5G环境下高校图书馆自媒体平台多标签文本进行预处理,包括无意义数据去除、文本分词以及去停用 词等;然后采用改进主成分分析方法进行多标签文本降维处理,利用向量空间模型实现文本平衡化处理;最后以处 理后的文本为基础,采用Adaboost和SVM两种算法构建文本分类器,实现多标签文本分类。【结果/结论】实验结果 表明,本文拟定的自媒体平台标签文本分类方法可以使汉明损失降低,F1值提高,多标签文本分类效果好,且耗时 较低,具有可靠性。【创新/局限】由于本研究中的数据集数量不够多,所以在测试和验证方面,得出的结果具有一定 局限性。因此在未来研究中期望利用更为丰富的数据库,对所设计的方法做出进一步的改进与创新。  相似文献   

10.
黄静  薛书田  肖进 《软科学》2017,(7):131-134
将半监督学习技术与多分类器集成模型Bagging相结合,构建类别分布不平衡环境下基于Bagging的半监督集成模型(SSEBI),综合利用有、无类别标签的样本来提高模型的性能.该模型主要包括三个阶段:(1)从无类别标签数据集中选择性标记一部分样本并训练若干个基本分类器;(2)使用训练好的基本分类器对测试集样本进行分类;(3)对分类结果进行集成得到最终分类结果.在五个客户信用评估数据集上进行实证分析,结果表明本研究提出的SSEBI模型的有效性.  相似文献   

11.
This paper studies how to learn accurate ranking functions from noisy training data for information retrieval. Most previous work on learning to rank assumes that the relevance labels in the training data are reliable. In reality, however, the labels usually contain noise due to the difficulties of relevance judgments and several other reasons. To tackle the problem, in this paper we propose a novel approach to learning to rank, based on a probabilistic graphical model. Considering that the observed label might be noisy, we introduce a new variable to indicate the true label of each instance. We then use a graphical model to capture the joint distribution of the true labels and observed labels given features of documents. The graphical model distinguishes the true labels from observed labels, and is specially designed for ranking in information retrieval. Therefore, it helps to learn a more accurate model from noisy training data. Experiments on a real dataset for web search show that the proposed approach can significantly outperform previous approaches.  相似文献   

12.
Machine learning applications must continually utilize label information from the data stream to detect concept drift and adapt to the dynamic behavior. Due to the computational expensiveness of label information, it is impractical to assume that the data stream is fully labeled. Therefore, much research focusing on semi-supervised concept drift detection has been proposed. Despite the large research effort in the literature, there is a lack of analysis on the information resources required with the achievable concept drift detection accuracy. Hence, this paper aims to answer the unexplored research question of “How many labeled samples are required to detect concept drift accurately?” by proposing an analytical framework to analyze and estimate the information resources required to detect concept drift accurately. Specifically, this paper disintegrates the distribution-based concept drift detection task into a learning task and a dissimilarity measurement task for independent analyses. The analyses results are then correlated to estimate the required number of labels within a set of data samples to detect concept drift accurately. The proximity of the information resources estimation is evaluated empirically, where the results suggest that the estimation is accurate with high amount of information resources provided. Additionally, estimation results of a state-of-the-art method and a benchmark data set are reported to show the applicability of the estimation by proposed analytical framework within benchmarked environments. In general, the estimation from the proposed analytical framework can serve as guidance in designing systems with limited information resources. This paper also hopes to assist in identifying research gaps and inspiring new research ideas regarding the analysis of the amount of information resources required for accurate concept drift detection.  相似文献   

13.
Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami’s method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy.  相似文献   

14.
15.
Information residing in multiple modalities (e.g., text, image) of social media posts can jointly provide more comprehensive and clearer insights into an ongoing emergency. To identify information valuable for humanitarian aid from noisy multimodal data, we first clarify the categories of humanitarian information, and define a multi-label multimodal humanitarian information identification task, which can adapt to the label inconsistency issue caused by modality independence while maintaining the correlation between modalities. We proposed a Multimodal Humanitarian Information Identification Model that simultaneously captures the Correlation and Independence between modalities (CIMHIM). A tailor-made dataset containing 4,383 annotated text-image pairs was built to evaluate the effectiveness of our model. The experimental results show that CIMHIM outperforms both unimodal and multimodal baseline methods by at least 0.019 in macro-F1 and 0.022 in accuracy. The combination of OCR text, object-level features, and the decision rule based on label correlations enhances the overall performance of CIMHIM. Additional experiments on a similar dataset (CrisisMMD) also demonstrate the robustness of CIMHIM. The task, model, and dataset proposed in this study contribute to the practice of leveraging multimodal social media resources to support effective emergency response.  相似文献   

16.
Most existing research on applying machine learning techniques to document summarization explores either classification models or learning-to-rank models. This paper presents our recent study on how to apply a different kind of learning models, namely regression models, to query-focused multi-document summarization. We choose to use Support Vector Regression (SVR) to estimate the importance of a sentence in a document set to be summarized through a set of pre-defined features. In order to learn the regression models, we propose several methods to construct the “pseudo” training data by assigning each sentence with a “nearly true” importance score calculated with the human summaries that have been provided for the corresponding document set. A series of evaluations on the DUC data sets are conducted to examine the efficiency and the robustness of the proposed approaches. When compared with classification models and ranking models, regression models are consistently preferable.  相似文献   

17.
18.
Hashing has been an emerging topic and has recently attracted widespread attention in multi-modal similarity search applications. However, most existing approaches rely on relaxation schemes to generate binary codes, leading to large quantization errors. In addition, amounts of existing approaches embed labels into the pairwise similarity matrix, leading to expensive time and space costs and losing category information. To address these issues, we propose an Efficient Discrete Matrix factorization Hashing (EDMH). Specifically, EDMH first learns the latent subspaces for individual modality through matrix factorization strategy, which preserves the semantic structure representation information of each modality. In particular, we develop a semantic label offset embedding learning strategy, improving the stability of label embedding regression. Furthermore, we design an efficient discrete optimization scheme to generate compact binary codes discretely. Eventually, we present two efficient learning strategies EDMH-L and EDMH-S to pursue high-quality hash functions. Extensive experiments on various widely-used databases verify that the proposed algorithms produce significant performance and outperform some state-of-the-art approaches, with an average improvement of 2.50% (for Wiki), 2.66% (for MIRFlickr) and 2.25% (for NUS-WIDE) over the best available results, respectively.  相似文献   

19.
Fact verification aims to retrieve relevant evidence from a knowledge base, e.g., Wikipedia, to verify the given claims. Existing methods only consider the sentence-level semantics for evidence representations, which typically neglect the importance of fine-grained features in the evidence-related sentences. In addition, the interpretability of the reasoning process has not been well studied in the field of fact verification. To address such issues, we propose an entity-graph based reasoning method for fact verification abbreviated as RoEG, which generates the fine-grained features of evidence at the entity-level and models the human reasoning paths based on an entity graph. In detail, to capture the semantic relations of retrieved evidence, RoEG introduces the entities as nodes and constructs the edges in the graph based on three linking strategies. Then, RoEG utilizes a selection gate to constrain the information propagation in the sub-graph of relevant entities and applies a graph neural network to propagate the entity-features for reasoning. Finally, RoEG employs an attention aggregator to gather the information of entities for label prediction. Experimental results on a large-scale benchmark dataset FEVER demonstrate the effectiveness of our proposal by beating the competitive baselines in terms of label accuracy and FEVER Score. In particular, for a task of multiple-evidence fact verification, RoEG produces 5.48% and 4.35% improvements in terms of label accuracy and FEVER Score against the state-of-the-art baseline. In addition, RoEG shows a better performance when more entities are involved for fact verification.  相似文献   

20.
Identifying petition expectation for government response plays an important role in government administrative service. Although some petition platforms allow citizens to label the petition expectation when they submit e-petitions, the misunderstanding and misselection of petition labels still has necessitated manual classification involved. Automatic petition expectation identification has faced challenges in poor context information, heavy noise and casual syntactic structure of the petition text. In this paper we propose a novel deep reinforcement learning based method for petition expectation (citizens’ demands for the level of government response) correction and identification named PecidRL. We collect a dataset from Message Board for Leaders, the largest official petition platform in China, containing 237,042 petitions. Firstly, we introduce a deep reinforcement learning framework to automatically correct the mislabeled and ambiguous labels of the petitions. Then, multi-view textual features, including word-level and document-level semantic features, sentiment features and different textual graph representations are extracted and integrated to enrich more auxiliary information. Furthermore, based on the corrected petitions, 19 novel petition expectation identification models are constructed by extending 11 popular machine learning models for petition expectation detection. Finally, comprehensive comparison and evaluation are conducted to select the final petition expectation identification model with the best performance. After performing correction by PecidRL, each metric on all extended petition expectation identification models improves by an average of 8.3% with the highest increase ratio reaching 14.2%. The optimal model is determined as Peti-SVM-bert with the highest accuracy 93.66%. We also analyze the petition expectation label variation of the dataset by using PecidRL. We derive that 16.9% of e-petitioners tend to exaggerate the urgency of their petitions to make the government pay high attention to their appeals and 4.4% of the petitions urgency are underestimated. This study has substantial academic and practical value in improving government efficiency. Additionally, a web-server is developed to facilitate government administrators and other researchers, which can be accessed at http://www.csbg-jlu.info/PecidRL/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号