首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 359 毫秒
1.
Graph Convolutional Networks (GCNs) have been established as a fundamental approach for representation learning on graphs, based on convolution operations on non-Euclidean domain, defined by graph-structured data. GCNs and variants have achieved state-of-the-art results on classification tasks, especially in semi-supervised learning scenarios. A central challenge in semi-supervised classification consists in how to exploit the maximum of useful information encoded in the unlabeled data. In this paper, we address this issue through a novel self-training approach for improving the accuracy of GCNs on semi-supervised classification tasks. A margin score is used through a rank-based model to identify the most confident sample predictions. Such predictions are exploited as an expanded labeled set in a second-stage training step. Our model is suitable for different GCN models. Moreover, we also propose a rank aggregation of labeled sets obtained by different GCN models. The experimental evaluation considers four GCN variations and traditional benchmarks extensively used in the literature. Significant accuracy gains were achieved for all evaluated models, reaching results comparable or superior to the state-of-the-art. The best results were achieved for rank aggregation self-training on combinations of the four GCN models.  相似文献   

2.
Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.  相似文献   

3.
The research field of crisis informatics examines, amongst others, the potentials and barriers of social media use during disasters and emergencies. Social media allow emergency services to receive valuable information (e.g., eyewitness reports, pictures, or videos) from social media. However, the vast amount of data generated during large-scale incidents can lead to issue of information overload. Research indicates that supervised machine learning techniques are suitable for identifying relevant messages and filter out irrelevant messages, thus mitigating information overload. Still, they require a considerable amount of labeled data, clear criteria for relevance classification, a usable interface to facilitate the labeling process and a mechanism to rapidly deploy retrained classifiers. To overcome these issues, we present (1) a system for social media monitoring, analysis and relevance classification, (2) abstract and precise criteria for relevance classification in social media during disasters and emergencies, (3) the evaluation of a well-performing Random Forest algorithm for relevance classification incorporating metadata from social media into a batch learning approach (e.g., 91.28%/89.19% accuracy, 98.3%/89.6% precision and 80.4%/87.5% recall with a fast training time with feature subset selection on the European floods/BASF SE incident datasets), as well as (4) an approach and preliminary evaluation for relevance classification including active, incremental and online learning to reduce the amount of required labeled data and to correct misclassifications of the algorithm by feedback classification. Using the latter approach, we achieved a well-performing classifier based on the European floods dataset by only requiring a quarter of labeled data compared to the traditional batch learning approach. Despite a lesser effect on the BASF SE incident dataset, still a substantial improvement could be determined.  相似文献   

4.
Semi-supervised document retrieval   总被引:2,自引:0,他引:2  
This paper proposes a new machine learning method for constructing ranking models in document retrieval. The method, which is referred to as SSRank, aims to use the advantages of both the traditional Information Retrieval (IR) methods and the supervised learning methods for IR proposed recently. The advantages include the use of limited amount of labeled data and rich model representation. To do so, the method adopts a semi-supervised learning framework in ranking model construction. Specifically, given a small number of labeled documents with respect to some queries, the method effectively labels the unlabeled documents for the queries. It then uses all the labeled data to train a machine learning model (in our case, Neural Network). In the data labeling, the method also makes use of a traditional IR model (in our case, BM25). A stopping criterion based on machine learning theory is given for the data labeling process. Experimental results on three benchmark datasets and one web search dataset indicate that SSRank consistently and almost always significantly outperforms the baseline methods (unsupervised and supervised learning methods), given the same amount of labeled data. This is because SSRank can effectively leverage the use of unlabeled data in learning.  相似文献   

5.
黄静  薛书田  肖进 《软科学》2017,(7):131-134
将半监督学习技术与多分类器集成模型Bagging相结合,构建类别分布不平衡环境下基于Bagging的半监督集成模型(SSEBI),综合利用有、无类别标签的样本来提高模型的性能.该模型主要包括三个阶段:(1)从无类别标签数据集中选择性标记一部分样本并训练若干个基本分类器;(2)使用训练好的基本分类器对测试集样本进行分类;(3)对分类结果进行集成得到最终分类结果.在五个客户信用评估数据集上进行实证分析,结果表明本研究提出的SSEBI模型的有效性.  相似文献   

6.
Dialectal Arabic (DA) refers to varieties of everyday spoken languages in the Arab world. These dialects differ according to the country and region of the speaker, and their textual content is constantly growing with the rise of social media networks and web blogs. Although research on Natural Language Processing (NLP) on standard Arabic, namely Modern Standard Arabic (MSA), has witnessed remarkable progress, research efforts on DA are rather limited. This is due to numerous challenges, such as the scarcity of labeled data as well as the nature and structure of DA. While some recent works have reached decent results on several DA sentence classification tasks, other complex tasks, such as sequence labeling, still suffer from weak performances when it comes to DA varieties with either a limited amount of labeled data or unlabeled data only. Besides, it has been shown that zero-shot transfer learning from models trained on MSA does not perform well on DA. In this paper, we introduce AdaSL, a new unsupervised domain adaptation framework for Arabic multi-dialectal sequence labeling, leveraging unlabeled DA data, labeled MSA data, and existing multilingual and Arabic Pre-trained Language Models (PLMs). The proposed framework relies on four key components: (1) domain adaptive fine-tuning of multilingual/MSA language models on unlabeled DA data, (2) sub-word embedding pooling, (3) iterative self-training on unlabeled DA data, and (4) iterative DA and MSA distribution alignment. We evaluate our framework on multi-dialectal Named Entity Recognition (NER) and Part-of-Speech (POS) tagging tasks.The overall results show that the zero-shot transfer learning, using our proposed framework, boosts the performance of the multilingual PLMs by 40.87% in macro-F1 score for the NER task, while it boosts the accuracy by 6.95% for the POS tagging task. For the Arabic PLMs, our proposed framework increases performance by 16.18% macro-F1 for the NER task and 2.22% accuracy for the POS tagging task, and thus, achieving new state-of-the-art zero-shot transfer learning performance for Arabic multi-dialectal sequence labeling.  相似文献   

7.
针对钢板表面缺陷图像分类传统深度学习算法中需要大量标签数据的问题,提出一种基于主动学习的高效分类方法。该方法包含一个轻量级的卷积神经网络和一个基于不确定性的主动学习样本筛选策略。神经网络采用简化的convolutional base进行特征提取,然后用全局池化层替换掉传统密集连接分类器中的隐藏层来减轻过拟合。为了更好的衡量模型对未标签图像样本所属类别的不确定性,首先将未标签图像样本传入到用标签图像样本训练好的模型,得到模型对每一个未标签样本关于标签的概率分布(probability distribution over classes, PDC),然后用此模型对标签样本进行预测并得到模型对每个标签的平均PDC。将两类分布的KL-divergence值作为不确定性指标来筛选未标签图像进行人工标注。根据在NEU-CLS开源缺陷数据集上的对比实验,该方法可以通过44%的标签数据实现97%的准确率,极大降低标注成本。  相似文献   

8.
Semi-supervised anomaly detection methods leverage a few anomaly examples to yield drastically improved performance compared to unsupervised models. However, they still suffer from two limitations: 1) unlabeled anomalies (i.e., anomaly contamination) may mislead the learning process when all the unlabeled data are employed as inliers for model training; 2) only discrete supervision information (such as binary or ordinal data labels) is exploited, which leads to suboptimal learning of anomaly scores that essentially take on a continuous distribution. Therefore, this paper proposes a novel semi-supervised anomaly detection method, which devises contamination-resilient continuous supervisory signals. Specifically, we propose a mass interpolation method to diffuse the abnormality of labeled anomalies, thereby creating new data samples labeled with continuous abnormal degrees. Meanwhile, the contaminated area can be covered by new data samples generated via combinations of data with correct labels. A feature learning-based objective is added to serve as an optimization constraint to regularize the network and further enhance the robustness w.r.t. anomaly contamination. Extensive experiments on 11 real-world datasets show that our approach significantly outperforms state-of-the-art competitors by 20%–30% in AUC-PR and obtains more robust and superior performance in settings with different anomaly contamination levels and varying numbers of labeled anomalies.  相似文献   

9.
Ranking aggregation is a task of combining multiple ranking lists given by several experts or simple rankers to get a hopefully better ranking. It is applicable in several fields such as meta search and collaborative filtering. Most of the existing work is under an unsupervised framework. In these methods, the performances are usually limited especially in unreliable case since labeled information is not involved in. In this paper, we propose a semi-supervised ranking aggregation method, in which preference constraints of several item pairs are given. In our method, the aggregation function is learned based on the ordering agreement of different rankers. The ranking scores assigned by this ranking function on the labeled data should be consistent with the given pairwise order constraints while the ranking scores on the unlabeled data obey the intrinsic manifold structure of the rank items. The experimental results on toy data and the OHSUMED data are presented to illustrate the validity of our method.  相似文献   

10.
This study aims to explore the relationships between user interaction and digital libraries (DLs) evaluation. User interaction is a multi-dimensional construct and recognized as three dimensions in this study, as user interaction with: information resource; interface; and, tasks. DL evaluation is considered from the user's perspective and defined as users’ perception of DL performance from different perspectives, including the support of DL's interaction design to user interaction (labeled as interaction-design-based (IDB) evaluation), the support of task completion (labeled as task-based evaluation), and a DL's overall performance (labeled as overall evaluation). An experiment with 48 participants was conducted using the China National Knowledge Infrastructure (CNKI (http://cnki.net/), the most widely used digital library in China). Participants searched for four simulated work tasks and one real work task during the experiment, subsequently evaluating their interaction with information resource, interface, and tasks, and DL performance from different perspectives before or after the search. Correlation analysis and stepwise regression analysis were conducted to examine the relationships. The results indicate that a list of factors related to different dimensions of user interaction can significantly predict or be correlated to users’ evaluation of DL performance from different perspectives, including appropriateness, rich and valid links, reasonable page layout, salience of topics, search task difficulty, well-organized web site, easy to learn, accessibility, usefulness, familiarity with task procedure, etc. These factors surface as the most critical criteria for DL evaluation. Based on the results, an integrated DL evaluation framework is developed. The study adds new knowledge about how tasks affect DL evaluation. It has implications for improving the efficiency of DL evaluation and helping DL developers design DLs to better support users’ interaction, task completion, and their overall experience with DLs.  相似文献   

11.
This research presents an enhanced approach for Aspect-Based Sentiment Analysis (ABSA) of Hotels’ Arabic reviews using supervised machine learning. The proposed approach employs a state-of-the-art research of training a set of classifiers with morphological, syntactic, and semantic features to address the research tasks namely: (a) T1:Aspect Category Identification, (b) T2:Opinion Target Expression (OTE) Extraction, and (c) T3: Sentiment Polarity Identification. Employed classifiers include Naïve Bayes, Bayes Networks, Decision Tree, K-Nearest Neighbor (K-NN), and Support-Vector Machine (SVM).The approach was evaluated using a reference dataset based on Semantic Evaluation 2016 workshop (SemEval-2016: Task-5). Results show that the supervised learning approach outperforms related work evaluated using the same dataset. More precisely, evaluation results show that all classifiers in the proposed approach outperform the baseline approach, and the overall enhancement for the best performing classifier (SVM) is around 53% for T1, around 59% for T2, and around 19% in T3.  相似文献   

12.
13.
Search Engine for South-East Europe (SE4SEE) is a socio-cultural search engine running on the grid infrastructure. It offers a personalized, on-demand, country-specific, category-based Web search facility. The main goal of SE4SEE is to attack the page freshness problem by performing the search on the original pages residing on the Web, rather than on the previously fetched copies as done in the traditional search engines. SE4SEE also aims to obtain high download rates in Web crawling by making use of the geographically distributed nature of the grid. In this work, we present the architectural design issues and implementation details of this search engine. We conduct various experiments to illustrate performance results obtained on a grid infrastructure and justify the use of the search strategy employed in SE4SEE.  相似文献   

14.
Despite a number of studies looking at Web experience and Web searching tactics and behaviours, the specific relationships between experience and cognitive search strategies have not been widely researched. This study investigates how the cognitive search strategies of 80 participants might vary with Web experience as they engaged in two researcher-defined tasks and two participant-defined information seeking tasks. Each of the two researcher-defined tasks and participant-defined tasks included a directed search task and a general-purpose browsing task. While there were almost no significant performance differences between experience levels on any of the four tasks, there were significant differences in the use of cognitive search strategies. Participants with higher levels of Web experience were more likely to use “Parallel player”, “Parallel hub-and-spoke”, “Known address search domain” and “Known address” strategies, whereas participants with lower levels of Web experience were more likely to use “Virtual tourist”, “Link-dependent”, “To-the-point”, “Sequential player”, “Search engine narrowing”, and “Broad first” strategies. The patterns of use and differences between researcher-defined and participant-defined tasks and between directed search tasks and general-purpose browsing tasks are also discussed, although the distribution of search strategies by Web experience were not statistically significant for each individual task.  相似文献   

15.
16.
Nowadays, data scientists are capable of manipulating and extracting complex information from time series data, given the current diversity of tools at their disposal. However, the plethora of tools that target data exploration and pattern search may require an extensive amount of time to develop methods that correspond to the data scientist's reasoning, in order to solve their queries. The development of new methods, tightly related with the reasoning and visual analysis of time series data, is of great relevance to improving complexity and productivity of pattern and query search tasks. In this work, we propose a novel tool, capable of exploring time series data for pattern and query search tasks in a set of 3 symbolic steps: Pre-Processing, Symbolic Connotation and Search. The framework is called SSTS (Symbolic Search in Time Series) and uses regular expression queries to search the desired patterns in a symbolic representation of the signal. By adopting a set of symbolic methods, this approach has the purpose of increasing the expressiveness in solving standard pattern and query tasks, enabling the creation of queries more closely related to the reasoning and visual analysis of the signal. We demonstrate the tool's effectiveness by presenting 9 examples with several types of queries on time series. The SSTS queries were compared with standard code developed in Python, in terms of cognitive effort, vocabulary required, code length, volume, interpretation and difficulty metrics based on the Halstead complexity measures. The results demonstrate that this methodology is a valid approach and delivers a new abstraction layer on data analysis of time series.  相似文献   

17.
Search systems are limited by their inability to distinguish between information that is on topic and information that is useful, i.e. suitable and applicable to the tasks at hand. This paper presents the results of two studies that examine a possible approach to identifying more useful documents through the relationships between searchers’ tasks and the document genres in the collection. A questionnaire and an experimental user study conducted in two domains, provide evidence that perceptions of usefulness are dependent upon information task type, document genre, and the relationship between these two factors. Expertise is also found to have an effect on usefulness. These results further our understanding of the role of task and genre interactive information retrieval.  相似文献   

18.
In this paper we present novel ensemble classifier architectures and investigate their influence for offline cursive character recognition. Cursive characters are represented by feature sets that portray different aspects of character images for recognition purposes. The recognition accuracy can be improved by training ensemble of classifiers on the feature sets. Given the feature sets and the base classifiers, we have developed multiple ensemble classifier compositions under four architectures. The first three architectures are based on the use of multiple feature sets whereas the fourth architecture is based on the use of a unique feature set. Type-1 architecture is composed of homogeneous base classifiers and Type-2 architecture is constructed using heterogeneous base classifiers. Type-3 architecture is based on hierarchical fusion of decisions. In Type-4 architecture a unique feature set is learned by a set of homogeneous base classifiers with different learning parameters. The experimental results demonstrate that the recognition accuracy achieved using the proposed ensemble classifier (with best composition of base classifiers and feature sets) is better than the existing recognition accuracies for offline cursive character recognition.  相似文献   

19.
本文针对百度学术搜索提出的“高校图书馆计划”, 致力于知识发现, 连接用户与图书馆的信息服务的目标, 通过引入超星发现系统进行比较分析, 对百度学术搜索系统在收录数据、检索功能、检索结果排序、数据挖掘服务、题录引用与导出、全文获取途径等方面进行实证分析, 考量百度学术搜索作为国内首个拥有亿级别索引量的互联网学术平台的中文学术资源搜索和服务能力。通过比较分析显示, 百度学术搜索需借鉴超星发现系统进一步优化和完善, 以期为国内学术平台研究、学术资源搜索平台的建设实践、性能评价等提供参考。  相似文献   

20.
Text classification is an important research topic in natural language processing (NLP), and Graph Neural Networks (GNNs) have recently been applied in this task. However, in existing graph-based models, text graphs constructed by rules are not real graph data and introduce massive noise. More importantly, for fixed corpus-level graph structure, these models cannot sufficiently exploit the labeled and unlabeled information of nodes. Meanwhile, contrastive learning has been developed as an effective method in graph domain to fully utilize the information of nodes. Therefore, we propose a new graph-based model for text classification named CGA2TC, which introduces contrastive learning with an adaptive augmentation strategy into obtaining more robust node representation. First, we explore word co-occurrence and document word relationships to construct a text graph. Then, we design an adaptive augmentation strategy for the text graph with noise to generate two contrastive views that effectively solve the noise problem and preserve essential structure. Specifically, we design noise-based and centrality-based augmentation strategies on the topological structure of text graph to disturb the unimportant connections and thus highlight the relatively important edges. As for the labeled nodes, we take the nodes with same label as multiple positive samples and assign them to anchor node, while we employ consistency training on unlabeled nodes to constrain model predictions. Finally, to reduce the resource consumption of contrastive learning, we adopt a random sample method to select some nodes to calculate contrastive loss. The experimental results on several benchmark datasets can demonstrate the effectiveness of CGA2TC on the text classification task.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号