首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Semi-supervised document retrieval   总被引:2,自引:0,他引:2  
This paper proposes a new machine learning method for constructing ranking models in document retrieval. The method, which is referred to as SSRank, aims to use the advantages of both the traditional Information Retrieval (IR) methods and the supervised learning methods for IR proposed recently. The advantages include the use of limited amount of labeled data and rich model representation. To do so, the method adopts a semi-supervised learning framework in ranking model construction. Specifically, given a small number of labeled documents with respect to some queries, the method effectively labels the unlabeled documents for the queries. It then uses all the labeled data to train a machine learning model (in our case, Neural Network). In the data labeling, the method also makes use of a traditional IR model (in our case, BM25). A stopping criterion based on machine learning theory is given for the data labeling process. Experimental results on three benchmark datasets and one web search dataset indicate that SSRank consistently and almost always significantly outperforms the baseline methods (unsupervised and supervised learning methods), given the same amount of labeled data. This is because SSRank can effectively leverage the use of unlabeled data in learning.  相似文献   

2.
针对钢板表面缺陷图像分类传统深度学习算法中需要大量标签数据的问题,提出一种基于主动学习的高效分类方法。该方法包含一个轻量级的卷积神经网络和一个基于不确定性的主动学习样本筛选策略。神经网络采用简化的convolutional base进行特征提取,然后用全局池化层替换掉传统密集连接分类器中的隐藏层来减轻过拟合。为了更好的衡量模型对未标签图像样本所属类别的不确定性,首先将未标签图像样本传入到用标签图像样本训练好的模型,得到模型对每一个未标签样本关于标签的概率分布(probability distribution over classes, PDC),然后用此模型对标签样本进行预测并得到模型对每个标签的平均PDC。将两类分布的KL-divergence值作为不确定性指标来筛选未标签图像进行人工标注。根据在NEU-CLS开源缺陷数据集上的对比实验,该方法可以通过44%的标签数据实现97%的准确率,极大降低标注成本。  相似文献   

3.
The advent of connected devices and omnipresence of Internet have paved way for intruders to attack networks, which leads to cyber-attack, financial loss, information theft in healthcare, and cyber war. Hence, network security analytics has become an important area of concern and has gained intensive attention among researchers, off late, specifically in the domain of anomaly detection in network, which is considered crucial for network security. However, preliminary investigations have revealed that the existing approaches to detect anomalies in network are not effective enough, particularly to detect them in real time. The reason for the inefficacy of current approaches is mainly due the amassment of massive volumes of data though the connected devices. Therefore, it is crucial to propose a framework that effectively handles real time big data processing and detect anomalies in networks. In this regard, this paper attempts to address the issue of detecting anomalies in real time. Respectively, this paper has surveyed the state-of-the-art real-time big data processing technologies related to anomaly detection and the vital characteristics of associated machine learning algorithms. This paper begins with the explanation of essential contexts and taxonomy of real-time big data processing, anomalous detection, and machine learning algorithms, followed by the review of big data processing technologies. Finally, the identified research challenges of real-time big data processing in anomaly detection are discussed.  相似文献   

4.
Stance detection is to distinguish whether the text’s author supports, opposes, or maintains a neutral stance towards a given target. In most real-world scenarios, stance detection needs to work in a zero-shot manner, i.e., predicting stances for unseen targets without labeled data. One critical challenge of zero-shot stance detection is the absence of contextual information on the targets. Current works mostly concentrate on introducing external knowledge to supplement information about targets, but the noisy schema-linking process hinders their performance in practice. To combat this issue, we argue that previous studies have ignored the extensive target-related information inhabited in the unlabeled data during the training phase, and propose a simple yet efficient Multi-Perspective Contrastive Learning Framework for zero-shot stance detection. Our framework is capable of leveraging information not only from labeled data but also from extensive unlabeled data. To this end, we design target-oriented contrastive learning and label-oriented contrastive learning to capture more comprehensive target representation and more distinguishable stance features. We conduct extensive experiments on three widely adopted datasets (from 4870 to 33,090 instances), namely SemEval-2016, WT-WT, and VAST. Our framework achieves 53.6%, 77.1%, and 72.4% macro-average F1 scores on these three datasets, showing 2.71% and 0.25% improvements over state-of-the-art baselines on the SemEval-2016 and WT-WT datasets and comparable results on the more challenging VAST dataset.  相似文献   

5.
Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.  相似文献   

6.
Machine learning applications must continually utilize label information from the data stream to detect concept drift and adapt to the dynamic behavior. Due to the computational expensiveness of label information, it is impractical to assume that the data stream is fully labeled. Therefore, much research focusing on semi-supervised concept drift detection has been proposed. Despite the large research effort in the literature, there is a lack of analysis on the information resources required with the achievable concept drift detection accuracy. Hence, this paper aims to answer the unexplored research question of “How many labeled samples are required to detect concept drift accurately?” by proposing an analytical framework to analyze and estimate the information resources required to detect concept drift accurately. Specifically, this paper disintegrates the distribution-based concept drift detection task into a learning task and a dissimilarity measurement task for independent analyses. The analyses results are then correlated to estimate the required number of labels within a set of data samples to detect concept drift accurately. The proximity of the information resources estimation is evaluated empirically, where the results suggest that the estimation is accurate with high amount of information resources provided. Additionally, estimation results of a state-of-the-art method and a benchmark data set are reported to show the applicability of the estimation by proposed analytical framework within benchmarked environments. In general, the estimation from the proposed analytical framework can serve as guidance in designing systems with limited information resources. This paper also hopes to assist in identifying research gaps and inspiring new research ideas regarding the analysis of the amount of information resources required for accurate concept drift detection.  相似文献   

7.
Dialectal Arabic (DA) refers to varieties of everyday spoken languages in the Arab world. These dialects differ according to the country and region of the speaker, and their textual content is constantly growing with the rise of social media networks and web blogs. Although research on Natural Language Processing (NLP) on standard Arabic, namely Modern Standard Arabic (MSA), has witnessed remarkable progress, research efforts on DA are rather limited. This is due to numerous challenges, such as the scarcity of labeled data as well as the nature and structure of DA. While some recent works have reached decent results on several DA sentence classification tasks, other complex tasks, such as sequence labeling, still suffer from weak performances when it comes to DA varieties with either a limited amount of labeled data or unlabeled data only. Besides, it has been shown that zero-shot transfer learning from models trained on MSA does not perform well on DA. In this paper, we introduce AdaSL, a new unsupervised domain adaptation framework for Arabic multi-dialectal sequence labeling, leveraging unlabeled DA data, labeled MSA data, and existing multilingual and Arabic Pre-trained Language Models (PLMs). The proposed framework relies on four key components: (1) domain adaptive fine-tuning of multilingual/MSA language models on unlabeled DA data, (2) sub-word embedding pooling, (3) iterative self-training on unlabeled DA data, and (4) iterative DA and MSA distribution alignment. We evaluate our framework on multi-dialectal Named Entity Recognition (NER) and Part-of-Speech (POS) tagging tasks.The overall results show that the zero-shot transfer learning, using our proposed framework, boosts the performance of the multilingual PLMs by 40.87% in macro-F1 score for the NER task, while it boosts the accuracy by 6.95% for the POS tagging task. For the Arabic PLMs, our proposed framework increases performance by 16.18% macro-F1 for the NER task and 2.22% accuracy for the POS tagging task, and thus, achieving new state-of-the-art zero-shot transfer learning performance for Arabic multi-dialectal sequence labeling.  相似文献   

8.
Making appropriate decisions is indeed a key factor to help companies facing challenges from supply chains nowadays. In this paper, we propose two data-driven approaches that allow making better decisions in supply chain management. In particular, we suggest a Long Short Term Memory (LSTM) network-based method for forecasting multivariate time series data and an LSTM Autoencoder network-based method combined with a one-class support vector machine algorithm for detecting anomalies in sales. Unlike other approaches, we recommend combining external and internal company data sources for the purpose of enhancing the performance of forecasting algorithms using multivariate LSTM with the optimal hyperparameters. In addition, we also propose a method to optimize hyperparameters for hybrid algorithms for detecting anomalies in time series data. The proposed approaches will be applied to both benchmarking datasets and real data in fashion retail. The obtained results show that the LSTM Autoencoder based method leads to better performance for anomaly detection compared to the LSTM based method suggested in a previous study. The proposed forecasting method for multivariate time series data also performs better than some other methods based on a dataset provided by NASA.  相似文献   

9.
Ranking is a central component in information retrieval systems; as such, many machine learning methods for building rankers have been developed in recent years. An open problem is transfer learning, i.e. how labeled training data from one domain/market can be used to build rankers for another. We propose a flexible transfer learning strategy based on sample selection. Source domain training samples are selected if the functional relationship between features and labels do not deviate much from that of the target domain. This is achieved through a novel application of recent advances from density ratio estimation. The approach is flexible, scalable, and modular. It allows many existing supervised rankers to be adapted to the transfer learning setting. Results on two datasets (Yahoo’s Learning to Rank Challenge and Microsoft’s LETOR data) show that the proposed method gives robust improvements.  相似文献   

10.
The wide spread of false information has detrimental effects on society, and false information detection has received wide attention. When new domains appear, the relevant labeled data is scarce, which brings severe challenges to the detection. Previous work mainly leverages additional data or domain adaptation technology to assist detection. The former would lead to a severe data burden; the latter underutilizes the pre-trained language model because there is a gap between the downstream task and the pre-training task, which is also inefficient for model storage because it needs to store a set of parameters for each domain. To this end, we propose a meta-prompt based learning (MAP) framework for low-resource false information detection. We excavate the potential of pre-trained language models by transforming the detection tasks into pre-training tasks by constructing template. To solve the problem of the randomly initialized template hindering excavation performance, we learn optimal initialized parameters by borrowing the benefit of meta learning in fast parameter training. The combination of meta learning and prompt learning for the detection is non-trivial: Constructing meta tasks to get initialized parameters suitable for different domains and setting up the prompt model’s verbalizer for classification in the noisy low-resource scenario are challenging. For the former, we propose a multi-domain meta task construction method to learn domain-invariant meta knowledge. For the latter, we propose a prototype verbalizer to summarize category information and design a noise-resistant prototyping strategy to reduce the influence of noise data. Extensive experiments on real-world data demonstrate the superiority of the MAP in new domains of false information detection.  相似文献   

11.
Anomalous data are such data that deviate from a large number of normal data points, which often have negative impacts on various systems. Current anomaly detection technology suffers from low detection accuracy, high false alarm rate and lack of labeled data. Anomaly detection is of great practical importance as an effective means to detect anomalies in the data and provide important support for the normal operation of various systems. In this paper, we propose an anomaly detection classification model that incorporates federated learning and mixed Gaussian variational self-encoding networks, namely MGVN. The proposed MGVN network model first constructs a variational self-encoder using a mixed Gaussian prior to extracting features from the input data, and then constructs a deep support vector network with the mixed Gaussian variational self-encoder to compress the feature space. The MGVN finds the minimum hypersphere to separate the normal and abnormal data and measures the abnormal fraction by calculating the Euclidean distance between the data features and the hypersphere center. Federated learning is finally incorporated with MGVN (FL-MGVN) to effectively address the problems that multiple participants collaboratively train a global model without sharing private data. The experiments are conducted on the benchmark datasets such as NSL-KDD, MNIST and Fashion-MNIST, which demonstrate that the proposed FL-MGVN has higher recognition performance and classification accuracy than other methods. The average AUC on MNIST and Fashion-MNIST reached 0.954 and 0.937, respectively.  相似文献   

12.
Transductive classification is a useful way to classify texts when labeled training examples are insufficient. Several algorithms to perform transductive classification considering text collections represented in a vector space model have been proposed. However, the use of these algorithms is unfeasible in practical applications due to the independence assumption among instances or terms and the drawbacks of these algorithms. Network-based algorithms come up to avoid the drawbacks of the algorithms based on vector space model and to improve transductive classification. Networks are mostly used for label propagation, in which some labeled objects propagate their labels to other objects through the network connections. Bipartite networks are useful to represent text collections as networks and perform label propagation. The generation of this type of network avoids requirements such as collections with hyperlinks or citations, computation of similarities among all texts in the collection, as well as the setup of a number of parameters. In a bipartite heterogeneous network, objects correspond to documents and terms, and the connections are given by the occurrences of terms in documents. The label propagation is performed from documents to terms and then from terms to documents iteratively. Nevertheless, instead of using terms just as means of label propagation, in this article we propose the use of the bipartite network structure to define the relevance scores of terms for classes through an optimization process and then propagate these relevance scores to define labels for unlabeled documents. The new document labels are used to redefine the relevance scores of terms which consequently redefine the labels of unlabeled documents in an iterative process. We demonstrated that the proposed approach surpasses the algorithms for transductive classification based on vector space model or networks. Moreover, we demonstrated that the proposed algorithm effectively makes use of unlabeled documents to improve classification and it is faster than other transductive algorithms.  相似文献   

13.
Text classification is an important research topic in natural language processing (NLP), and Graph Neural Networks (GNNs) have recently been applied in this task. However, in existing graph-based models, text graphs constructed by rules are not real graph data and introduce massive noise. More importantly, for fixed corpus-level graph structure, these models cannot sufficiently exploit the labeled and unlabeled information of nodes. Meanwhile, contrastive learning has been developed as an effective method in graph domain to fully utilize the information of nodes. Therefore, we propose a new graph-based model for text classification named CGA2TC, which introduces contrastive learning with an adaptive augmentation strategy into obtaining more robust node representation. First, we explore word co-occurrence and document word relationships to construct a text graph. Then, we design an adaptive augmentation strategy for the text graph with noise to generate two contrastive views that effectively solve the noise problem and preserve essential structure. Specifically, we design noise-based and centrality-based augmentation strategies on the topological structure of text graph to disturb the unimportant connections and thus highlight the relatively important edges. As for the labeled nodes, we take the nodes with same label as multiple positive samples and assign them to anchor node, while we employ consistency training on unlabeled nodes to constrain model predictions. Finally, to reduce the resource consumption of contrastive learning, we adopt a random sample method to select some nodes to calculate contrastive loss. The experimental results on several benchmark datasets can demonstrate the effectiveness of CGA2TC on the text classification task.  相似文献   

14.
Search task success rate is an important indicator to measure the performance of search engines. In contrast to most of the previous approaches that rely on labeled search tasks provided by users or third-party editors, this paper attempts to improve the performance of search task success evaluation by exploiting unlabeled search tasks that are existing in search logs as well as a small amount of labeled ones. Concretely, the Multi-view Active Semi-Supervised Search task Success Evaluation (MA4SE) approach is proposed, which exploits labeled data and unlabeled data by integrating the advantages of both semi-supervised learning and active learning with the multi-view mechanism. In the semi-supervised learning part of MA4SE, we employ a multi-view semi-supervised learning approach that utilizes different parameter configurations to achieve the disagreement between base classifiers. The base classifiers are trained separately from the pre-defined action and time views. In the active learning part of MA4SE, each classifier received from semi-supervised learning is applied to unlabeled search tasks, and the search tasks that need to be manually annotated are selected based on both the degree of disagreement between base classifiers and a regional density measurement. We evaluate the proposed approach on open datasets with two different definitions of search tasks success. The experimental results show that MA4SE outperforms the state-of-the-art semi-supervised search task success evaluation approach.  相似文献   

15.
黄静  薛书田  肖进 《软科学》2017,(7):131-134
将半监督学习技术与多分类器集成模型Bagging相结合,构建类别分布不平衡环境下基于Bagging的半监督集成模型(SSEBI),综合利用有、无类别标签的样本来提高模型的性能.该模型主要包括三个阶段:(1)从无类别标签数据集中选择性标记一部分样本并训练若干个基本分类器;(2)使用训练好的基本分类器对测试集样本进行分类;(3)对分类结果进行集成得到最终分类结果.在五个客户信用评估数据集上进行实证分析,结果表明本研究提出的SSEBI模型的有效性.  相似文献   

16.
17.
In synthetic aperture radar (SAR) image change detection, the deep learning has attracted increasingly more attention because the difference images (DIs) of traditional unsupervised technology are vulnerable to speckle noise. However, most of the existing deep networks do not constrain the distributional characteristics of the hidden space, which may affect the feature representation performance. This paper proposes a variational autoencoder (VAE) network with the siamese structure to detect changes in SAR images. The VAE encodes the input as a probability distribution in the hidden space to obtain regular hidden layer features with a good representation ability. Furthermore, subnetworks with the same parameters and structure can extract the spatial consistency features of the original image, which is conducive to the subsequent classification. The proposed method includes three main steps. First, the training samples are selected based on the false labels generated by a clustering algorithm. Then, we train the proposed model with the semisupervised learning strategy, including unsupervised feature learning and supervised network fine-tuning. Finally, input the original data instead of the DIs in the trained network to obtain the change detection results. The experimental results on four real SAR datasets show the effectiveness and robustness of the proposed method.  相似文献   

18.
针对数据流高速、无限连续和动态不确定性等特点,从提高不确定数据流数据管理能力的角度来解决不确定数据流中异常数据识别问题。首先采用小波分析,将连续数据流流量数据的高频与低频分量分离;其次,结合不确定数据流聚类方法找出数据中的异常点。仿真实验证明,该检测方法能够良好地适应数据流的不确定性,在一定条件下可获得相当好的检测效果。  相似文献   

19.
Most of the existing large-scale high-dimensional streaming anomaly detection methods suffer from extremely high time and space complexity. Moreover, these models are very sensitive to parameters,make their generalization ability very low, can also be merely applied to very few specific application scenarios. This paper proposes a three-layer structure high-dimensional streaming anomaly detection model, which is called the double locality sensitive hashing Bloom filter, namely dLSHBF. We first build the former two layers that is double locality sensitive hashing (dLSH), proving that the dLSH method reduces the hash coding length of the data, and it ensures that the projected data still has a favorable mapping distance-preserving property after projection. Second, we use a Bloom filter to build the third layer of dLSHBF model, which used to improve the efficiency of anomaly detection. Six large-scale high-dimensional data stream datasets in different IIoT anomaly detection domains were selected for comparison experiments. First, extensive experiments show that the distance-preserving performance of the former dLSH algorithm proposed in this paper is significantly better than the existing LSH algorithms. Second, we verify the dLSHBF model more efficient than the other existing advanced Bloom filter model (for example Robust Bloom Filter, Fly Bloom Filter, Sandwich Learned Bloom Filter, Adaptive Learned Bloom Filters). Compared with the state of the art, dLSHBF can perform with the detection rate (DR) and false alarm rate (FAR) of anomaly detection more than 97%, and less than 2.2% respectively. Its effectiveness and generalization ability outperform other existing streaming anomaly detection methods.  相似文献   

20.
Few-Shot Event Classification (FSEC) aims at assigning event labels to unlabeled sentences when limited annotated samples are available. Existing works mainly focus on using meta-learning to overcome the low-resource problem that still requires abundant held-out classes for model learning and selection. Thus we propose to deal with the low-resource problem by utilizing prompts. Further, existing methods suffer from severe trigger biases that may result in ignorance of the context. That is, the correct classifications are gained by looking at only the triggers, which hurts the model’s generalization ability. Thus, we propose a knowledgeable augmented-trigger prompt FSEC framework (AugPrompt), which can overcome the bias issues and alleviates the classification bottleneck brought by insufficient data. In detail, we first design an External Knowledge Injection (EKI) module to incorporate an external knowledge base (Related Words) for trigger augmentation. Then, we propose an Event Prompt Generation (EPG) module to generate appropriate discrete prompts for initializing the continuous prompts. After that, we propose an Event Prompt Tuning (EPT) module to automatically search prompts in the continuous space for FSEC and finally predict the corresponding event types of the inputs. We conduct extensive experiments on two public English datasets for FSEC, i.e., FewEvent and RAMS. The experimental results show the superiority of our proposal over the competitive baselines, where the maximum accuracy increase compared to the strongest baseline reaches 10.8%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号