首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 234 毫秒
1.
林萍  吕健超 《情报科学》2023,41(2):135-142
【目的/意义】提出基于Stacking集成学习的问答信息采纳行为识别策略,促进在线健康社区问答的精准化推送、助推数字化医疗服务高质量发展。【方法/过程】构建以集成学习方法和非集成学习方法为基学习器、以逻辑回归算法(LR)为元学习器的Stacking集成学习模型,比较单预测模型、同类预测模型组合、不同类预测模型组合的Stacking集成学习模型预测精度,选取“寻医问药”平台的慢性病问答构建数据集验证模型的优越性,并选取“快速问医生有问必答120”平台数据验证模型的可移植性。【结果/结论】Stacking集成模型相比于单预测模型能够更精准识别被采纳问答信息,模型具有较强的泛化性,可以适用于不同的在线健康社区。【创新/局限】本文基于Stacking集成思想构建两阶段预测模型,并借助机器学习构建最佳预测模型组合,显著提高在线健康社区问答信息采纳识别精度,但伴随问答信息积累,在线健康社区问答模式不断发展变化,考虑结合历史数据和每日更新数据的动态预测方法是未来研究工作重点。  相似文献   

2.
As compared to the continuous temporal distributions, discrete data representations may be desired for simplified and faster data analysis and forecasting. Data compression can introduce one of the efficient ways to reduce continuous historical stock market data and present them in discrete forms; while predicting stock trend, a primary concern is towards up and down directions of the price movement and thus, data discretization for a focused approach can be beneficial. In this article, we propose a quantization-based data fusion approach with a primary motivation to reduce data complexity and hence, enhance the prediction ability of a model. Here, the continuous time-series values are transformed into discrete quantum values prior to applying them to a prediction model. We extend the proposed approach and factorize quantization by integrating different quantization step sizes. Such fused data can reduce the data to mainly concentrate on the stock price movement direction. To empirically evaluate the proposed approach for stock trend prediction, we adopt long short-term memory, deep neural network, and backpropagation neural network models and compare our prediction results with five existing approaches on several datasets using ten performance metrics. We analyze the impact of specific quantization factors and determine the individual best as well as overall best factor sizes; the results indicate a consistent performance enhancement in stock trend prediction accuracy as compared to the considered baseline methods with an improvement up to 7%. To evaluate the impact of quantization-based data fusion, we analyze time required to execute the experiments along with percentage reduction in the number of unique numeric terms. Further, these results are statistically evaluated using Wilcoxon signed-rank test. We discuss the superiority and applicability of factored quantization-based data fusion approach and conclude our work with potential future research directions.  相似文献   

3.
Textual entailment is a task for which the application of supervised learning mechanisms has received considerable attention as driven by successive Recognizing Data Entailment data challenges. We developed a linguistic analysis framework in which a number of similarity/dissimilarity features are extracted for each entailment pair in a data set and various classifier methods are evaluated based on the instance data derived from the extracted features. The focus of the paper is to compare and contrast the performance of single and ensemble based learning algorithms for a number of data sets. We showed that there is some benefit to the use of ensemble approaches but, based on the extracted features, Naïve Bayes proved to be the strongest learning mechanism. Only one ensemble approach demonstrated a slight improvement over the technique of Naïve Bayes.  相似文献   

4.
详细介绍了一个新的大样本集合预报系统. 为了减小ENSO(厄尔尼诺-南方涛动)预报中的预报不确定性,该集合预报系统首先基于一个中等复杂程度的耦合模式,利用集合卡尔曼滤波资料同化方法同化有效的海洋观测资料为集合预报系统提供集合初始场;同时,一个发展的用于12个月预报的一阶线性马尔可夫(Markov)随机误差模式被嵌套到集合预报系统中来模拟模式不确定性. 基于1992年11月~2008年10月100个样本的集合回报试验,从确定性预报技巧和概率预报技巧2个方面对集合预报系统的预报水平进行了检验. 该集合预报方法能够很有效地将传统的确定性预报扩展到概率预报领域,且检验结果表明,预报样本均值的预报水平要优于单一的确定性预报. 对于概率预报而言,集合预报样本能够很好地跟随观测的变化,并且能够提供单纯确定性预报所不能够提供的额外信息.  相似文献   

5.
提出基于集成学习的项目绩效预测方法,利用多分类集成监督学习算法,对网络爬虫得到的已结题项目数据中隐含的关于项目绩效的信息进行有效挖掘,形成项目绩效预测模型.基于国家自然科学基金项目数据,利用多种指标对模型的性能进行评估,将模型对项目的绩效预测结果与专家的评估结果进行比较,结果显示模型的有效性.  相似文献   

6.
Object matching is an important task for finding the correspondence between objects in different domains, such as documents in different languages and users in different databases. In this paper, we propose probabilistic latent variable models that offer many-to-many matching without correspondence information or similarity measures between different domains. The proposed model assumes that there is an infinite number of latent vectors that are shared by all domains, and that each object is generated from one of the latent vectors and a domain-specific projection. By inferring the latent vector used for generating each object, objects in different domains are clustered according to the vectors that they share. Thus, we can realize matching between groups of objects in different domains in an unsupervised manner. We give learning procedures of the proposed model based on a stochastic EM algorithm. We also derive learning procedures in a semi-supervised setting, where correspondence information for some objects are given. The effectiveness of the proposed models is demonstrated by experiments on synthetic and real data sets.  相似文献   

7.
Deep forest     
Current deep-learning models are mostly built upon neural networks, i.e. multiple layers of parameterized differentiable non-linear modules that can be trained by backpropagation. In this paper, we explore the possibility of building deep models based on non-differentiable modules such as decision trees. After a discussion about the mystery behind deep neural networks, particularly by contrasting them with shallow neural networks and traditional machine-learning techniques such as decision trees and boosting machines, we conjecture that the success of deep neural networks owes much to three characteristics, i.e. layer-by-layer processing, in-model feature transformation and sufficient model complexity. On one hand, our conjecture may offer inspiration for theoretical understanding of deep learning; on the other hand, to verify the conjecture, we propose an approach that generates deep forest holding these characteristics. This is a decision-tree ensemble approach, with fewer hyper-parameters than deep neural networks, and its model complexity can be automatically determined in a data-dependent way. Experiments show that its performance is quite robust to hyper-parameter settings, such that in most cases, even across different data from different domains, it is able to achieve excellent performance by using the same default setting. This study opens the door to deep learning based on non-differentiable modules without gradient-based adjustment, and exhibits the possibility of constructing deep models without backpropagation.  相似文献   

8.
In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting the semantic context in text documents. Specifically, we generate new samples of rare classes (categories with relatively small amount of training data) by using global semantic information of classes represented by probabilistic topic models. In this way, the numbers of samples in different categories can become more balanced and the performance of text categorization can be improved using this transformed data set. Indeed, the proposed method is different from traditional re-sampling methods, which try to balance the number of documents in different classes by re-sampling the documents in rare classes. Such re-sampling methods can cause overfitting. Another benefit of our approach is the effective handling of noisy samples. Since all the new samples are generated by topic models, the impact of noisy samples is dramatically reduced. Finally, as demonstrated by the experimental results, the proposed methods can achieve better performance under class imbalance and is more tolerant to noisy samples.  相似文献   

9.
This paper studies how to learn accurate ranking functions from noisy training data for information retrieval. Most previous work on learning to rank assumes that the relevance labels in the training data are reliable. In reality, however, the labels usually contain noise due to the difficulties of relevance judgments and several other reasons. To tackle the problem, in this paper we propose a novel approach to learning to rank, based on a probabilistic graphical model. Considering that the observed label might be noisy, we introduce a new variable to indicate the true label of each instance. We then use a graphical model to capture the joint distribution of the true labels and observed labels given features of documents. The graphical model distinguishes the true labels from observed labels, and is specially designed for ranking in information retrieval. Therefore, it helps to learn a more accurate model from noisy training data. Experiments on a real dataset for web search show that the proposed approach can significantly outperform previous approaches.  相似文献   

10.
Clustering is a basic technique in information processing. Traditional clustering methods, however, are not suitable for high dimensional data. Thus, learning a subspace for clustering has emerged as an important research direction. Nevertheless, the meaningful data are often lying on a low dimensional manifold while existing subspace learning approaches cannot fully capture the nonlinear structures of hidden manifold. In this paper, we propose a novel subspace learning method that not only characterizes the linear and nonlinear structures of data, but also reflects the requirements of following clustering. Compared with other related approaches, the proposed method can derive a subspace that is more suitable for high dimensional data clustering. Promising experimental results on different kinds of data sets demonstrate the effectiveness of the proposed approach.  相似文献   

11.
Imbalanced sample distribution is usually the main reason for the performance degradation of machine learning algorithms. Based on this, this study proposes a hybrid framework (RGAN-EL) combining generative adversarial networks and ensemble learning method to improve the classification performance of imbalanced data. Firstly, we propose a training sample selection strategy based on roulette wheel selection method to make GAN pay more attention to the class overlapping area when fitting the sample distribution. Secondly, we design two kinds of generator training loss, and propose a noise sample filtering method to improve the quality of generated samples. Then, minority class samples are oversampled using the improved RGAN to obtain a balanced training sample set. Finally, combined with the ensemble learning strategy, the final training and prediction are carried out. We conducted experiments on 41 real imbalanced data sets using two evaluation indexes: F1-score and AUC. Specifically, we compare RGAN-EL with six typical ensemble learning; RGAN is compared with three typical GAN models. The experimental results show that RGAN-EL is significantly better than the other six ensemble learning methods, and RGAN is greatly improved compared with three classical GAN models.  相似文献   

12.
Interdocument similarities are the fundamental information source required in cluster-based retrieval, which is an advanced retrieval approach that significantly improves performance during information retrieval (IR). An effective similarity metric is query-sensitive similarity, which was introduced by Tombros and Rijsbergen as method to more directly satisfy the cluster hypothesis that forms the basis of cluster-based retrieval. Although this method is reported to be effective, existing applications of query-specific similarity are still limited to vector space models wherein there is no connection to probabilistic approaches. We suggest a probabilistic framework that defines query-sensitive similarity based on probabilistic co-relevance, where the similarity between two documents is proportional to the probability that they are both co-relevant to a specific given query. We further simplify the proposed co-relevance-based similarity by decomposing it into two separate relevance models. We then formulate all the requisite components for the proposed similarity metric in terms of scoring functions used by language modeling methods. Experimental results obtained using standard TREC test collections consistently showed that the proposed query-sensitive similarity measure performs better than term-based similarity and existing query-sensitive similarity in the context of Voorhees’ nearest neighbor test (NNT).  相似文献   

13.
Health monitoring of nonlinear systems is broadly concerned with the system health tracking and its prediction to future time horizons. Estimation and prediction schemes constitute as principle components of any health monitoring technique. Particle filter (PF) represents a powerful tool for performing state and parameter estimation as well as prediction of nonlinear dynamical systems. Estimation of the system parameters along with the states can yield an up-to-date and reliable model that can be used for long-term prediction problems through utilization of particle filters. This feature enables one to deal with uncertainty issues in the resulting prediction step as the time horizon is extended. Towards this end, this paper presents an improved method to achieve uncertainty management for long-term prediction of nonlinear systems by using particle filters. In our proposed approach, an observation forecasting scheme is developed to extend the system observation profiles (as time-series) to future time horizon. Particles are then propagated to future time instants according to a resampling algorithm instead of considering constant weights for the particles propagation in the prediction step. The uncertainty in the long-term prediction of the system states and parameters are managed by utilizing dynamic linear models for development of an observation forecasting scheme. This task is addressed through an outer adjustment loop for adaptively changing the sliding observation injection window based on the Mahalanobis distance criterion. Our proposed approach is then applied to predicting the health condition as well as the remaining useful life (RUL) of a gas turbine engine that is affected by degradations in the system health parameters. Extensive simulation and case studies are conducted to demonstrate and illustrate the capabilities and performance characteristics of our proposed and developed schemes.  相似文献   

14.
With the increasing rate of adoption and growth of cloud computing services, businesses have been shifting their information technology (IT) infrastructure to the cloud. Although cloud vendors promise high availability and reliability of their cloud services, cloud-related incidents involving outages and service disruptions remain a challenge. Understanding cloud incidents and the ability to predict them would be helpful in deciding how to manage and circumvent future incidents. In this study, we propose a hybrid model that employs machine learning and time series methods to forecast cloud incidents. We evaluate the proposed model using a sample of 2261 incidents collected from two cloud providers namely, Netflix and Hulu. Unique to this study is that our model relies solely on historical data that is independent of the underlying cloud infrastructure. Results suggest that the proposed hybrid model outperforms individual forecasting models: neural network, time series and random forest. Results also reveal important temporal insights from the proposed model and highlights the practical relevance of historical data to forecast and manage cloud incidents.  相似文献   

15.
Search task success rate is an important indicator to measure the performance of search engines. In contrast to most of the previous approaches that rely on labeled search tasks provided by users or third-party editors, this paper attempts to improve the performance of search task success evaluation by exploiting unlabeled search tasks that are existing in search logs as well as a small amount of labeled ones. Concretely, the Multi-view Active Semi-Supervised Search task Success Evaluation (MA4SE) approach is proposed, which exploits labeled data and unlabeled data by integrating the advantages of both semi-supervised learning and active learning with the multi-view mechanism. In the semi-supervised learning part of MA4SE, we employ a multi-view semi-supervised learning approach that utilizes different parameter configurations to achieve the disagreement between base classifiers. The base classifiers are trained separately from the pre-defined action and time views. In the active learning part of MA4SE, each classifier received from semi-supervised learning is applied to unlabeled search tasks, and the search tasks that need to be manually annotated are selected based on both the degree of disagreement between base classifiers and a regional density measurement. We evaluate the proposed approach on open datasets with two different definitions of search tasks success. The experimental results show that MA4SE outperforms the state-of-the-art semi-supervised search task success evaluation approach.  相似文献   

16.
This paper presents a classifier for text data samples consisting of main text and additional components, such as Web pages and technical papers. We focus on multiclass and single-labeled text classification problems and design the classifier based on a hybrid composed of probabilistic generative and discriminative approaches. Our formulation considers individual component generative models and constructs the classifier by combining these trained models based on the maximum entropy principle. We use naive Bayes models as the component generative models for the main text and additional components such as titles, links, and authors, so that we can apply our formulation to document and Web page classification problems. Our experimental results for four test collections confirmed that our hybrid approach effectively combined main text and additional components and thus improved classification performance.  相似文献   

17.
Detection at an early stage is vital for the diagnosis of the majority of critical illnesses and is the same for identifying people suffering from depression. Nowadays, a number of researches have been done successfully to identify depressed persons based on their social media postings. However, an unexpected bias has been observed in these studies, which can be due to various factors like unequal data distribution. In this paper, the imbalance found in terms of participation in the various age groups and demographics is normalized using the one-shot decision approach. Further, we present an ensemble model combining SVM and KNN with the intrinsic explainability in conjunction with noisy label correction approaches, offering an innovative solution to the problem of distinguishing between depression symptoms and suicidal ideas. We achieved a final classification accuracy of 98.05%, with the proposed ensemble model ensuring that the data classification is not biased in any manner.  相似文献   

18.
赵雪花  陈旭 《资源科学》2015,37(6):1173-1180
针对径流时间序列的非平稳特性及中长期预测精度低的问题,本文提出一种新的耦合预测方法:基于EMD分解的均生函数-最优子集回归(Mean Generating Function-Optimum Subset Regression,MGF-OSR)模型。首先利用经验模态分解(Empirical Mode Decomposition,EMD)方法对汾河上游上静游、汾河水库、寨上和兰村4座水文站的年径流序列进行平稳化处理,分别得到若干个固有模态函数(Intrinsic Mode Function,IMF)。对各阶固有模态函数分别建立MGF-OSR模型并进行预测,趋势项用直线拟合的方法进行预测,然后通过重构各预测值得到汾河上游4座水文站年径流量的预测结果,并与单独运用MGF-OSR模型的预测结果进行比较。结果表明,运用基于EMD分解的MGF-OSR模型对汾河上游4站年径流进行预测,准确率均为100%,确定性系数在0.975以上;而单一模型的预测准确率均为40%,确定性系数在0.732以下,耦合模型预测精度明显提高。  相似文献   

19.
针对电力系统对短期电力负荷预测精确性的需求,以长短期记忆算法为基础,采用差分自适应进化算法对其进一步改进,从而提出一种基于机器学习的混合算法(SaDE-LSTM)对电力负荷进行短期预测。基于我国2004—2018年间月度社会用电负荷数据,对改进后的混合算法进行性能测试,首先利用差分进化算法的自适应变异和交叉因子来优化长短期记忆算法的初始参数,在此基础上,运用寻优得到的参数训练长短期记忆算法从而得到优化后的预测结果。为证明其优越性,对同组数据采用支持向量机(SVM)、反向传播神经网络、自回归积分滑动平均等算法分别预测。各方法预测结果和真实结果对比分析证明,SaDE-LSTM算法对时间序列数据量要求较低,同时相比其他传统算法有更高的预测精度。该改进算法能够为参与电力系统调度的虚拟电厂、负荷聚合商等对小样本和高精度预测有需求的主体提供参考。  相似文献   

20.
Research on automated social media rumour verification, the task of identifying the veracity of questionable information circulating on social media, has yielded neural models achieving high performance, with accuracy scores that often exceed 90%. However, none of these studies focus on the real-world generalisability of the proposed approaches, that is whether the models perform well on datasets other than those on which they were initially trained and tested. In this work we aim to fill this gap by assessing the generalisability of top performing neural rumour verification models covering a range of different architectures from the perspectives of both topic and temporal robustness. For a more complete evaluation of generalisability, we collect and release COVID-RV, a novel dataset of Twitter conversations revolving around COVID-19 rumours. Unlike other existing COVID-19 datasets, our COVID-RV contains conversations around rumours that follow the format of prominent rumour verification benchmarks, while being different from them in terms of topic and time scale, thus allowing better assessment of the temporal robustness of the models. We evaluate model performance on COVID-RV and three popular rumour verification datasets to understand limitations and advantages of different model architectures, training datasets and evaluation scenarios. We find a dramatic drop in performance when testing models on a different dataset from that used for training. Further, we evaluate the ability of models to generalise in a few-shot learning setup, as well as when word embeddings are updated with the vocabulary of a new, unseen rumour. Drawing upon our experiments we discuss challenges and make recommendations for future research directions in addressing this important problem.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号