首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.
The rising popularity of social media posts, most notably Twitter posts, as a data source for social science research poses significant problems with regard to access to representative, high-quality data for analysis. Cheap, publicly available data such as that obtained from Twitter's public application programming interfaces is often of low quality, while high-quality data is expensive both financially and computationally. Moreover, data is often available only in real-time, making post-hoc analysis difficult or impossible. We propose and test a methodology for inexpensively creating an archive of Twitter data through population sampling, yielding a database that is highly representative of the targeted user population (in this test case, the entire population of Japanese-language Twitter users). Comparing the tweet volume, keywords, and topics found in our sample data set with the ground truth of Twitter's full data feed confirmed a very high degree of representativeness in the sample. We conclude that this approach yields a data set that is suitable for a wide range of post-hoc analyses, while remaining cost effective and accessible to a wide range of researchers.  相似文献   

2.
《Research Policy》2022,51(7):104558
Clean energy technologies are important for meeting long-term climate and competitiveness goals. But clean energy industries are part of global value chains (GVCs), where past manufacturing shifts from developed to emerging economies have raised questions on a decline in long-term innovation. Our research centers on how geographic shifts in the GVC shape long-term innovation, i.e., innovation in a time frame within which “mission-oriented”, societal, or firm strategic objectives need to be met rather than tactical, near-term market competitiveness alone. Focusing on wind energy, we introduce a temporal measure to distinguish between long-term and short-term innovation, applying natural language processing methods on patent text data. We consider supply-side value chain factors (i.e., manufacturing supplier relationships with original equipment manufacturers (OEMs)) and demand-side factors (i.e., policy-induced clean energy market growth), shaping the patenting activities of 358 global specialized wind suppliers (2006–2016). Our findings suggest that the wind industry did not suppress long-term innovation during manufacturing shifts, in this case to China. After 2012 when China developed a large wind market, long-term innovation increased by 80.7% in European suppliers working with non-European OEMs (including Chinese) and by 67.2% in Chinese suppliers working with non-Chinese OEMs. Our results highlight the importance of coupling international manufacturing relationships with sizeable local demand for inducing long-term innovation. Our results advance research in innovation, GVCs, and green industrial policy with implications for several industries that can contribute to climate mitigation.  相似文献   

3.
Financial decisions are often based on classification models which are used to assign a set of observations into predefined groups. Different data classification models were developed to foresee the financial crisis of an organization using their historical data. One important step towards the development of accurate financial crisis prediction (FCP) model involves the selection of appropriate variables (features) which are relevant for the problems at hand. This is termed as feature selection problem which helps to improve the classification performance. This paper proposes an Ant Colony Optimization (ACO) based financial crisis prediction (FCP) model which incorporates two phases: ACO based feature selection (ACO-FS) algorithm and ACO based data classification (ACO-DC) algorithm. The proposed ACO-FCP model is validated using a set of five benchmark dataset includes both qualitative and quantitative. For feature selection design, the developed ACO-FS method is compared with three existing feature selection algorithms namely genetic algorithm (GA), Particle Swarm Optimization (PSO) algorithm and Grey Wolf Optimization (GWO) algorithm. In addition, a comparison of classification results is also made between ACO-DC and state of art methods. Experimental analysis shows that the ACO-FCP ensemble model is superior and more robust than its counterparts. In consequence, this study strongly recommends that the proposed ACO-FCP model is highly competitive than traditional and other artificial intelligence techniques.  相似文献   

4.
This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class, multi-labelled, and hierarchical. For each classification task, the study relates a set of changes in a confusion matrix to specific characteristics of data. Then the analysis concentrates on the type of changes to a confusion matrix that do not change a measure, therefore, preserve a classifier’s evaluation (measure invariance). The result is the measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem. This formal analysis is supported by examples of applications where invariance properties of measures lead to a more reliable evaluation of classifiers. Text classification supplements the discussion with several case studies.  相似文献   

5.
在支持向量机和遗传算法的基础上,提出一种新的启发式多层文本分类算法。实验结果证明了该算法的可行性和有效性。文本分类技术是解决大规模文本处理的有效途径。  相似文献   

6.
提出基于集成学习的项目绩效预测方法,利用多分类集成监督学习算法,对网络爬虫得到的已结题项目数据中隐含的关于项目绩效的信息进行有效挖掘,形成项目绩效预测模型.基于国家自然科学基金项目数据,利用多种指标对模型的性能进行评估,将模型对项目的绩效预测结果与专家的评估结果进行比较,结果显示模型的有效性.  相似文献   

7.
High-resolution probabilistic load forecasting can comprehensively characterize both the uncertainties and the dynamic trends of the future load. Such information is key to the reliable operation of the future power grid with a high penetration of renewables. To this end, various high-resolution probabilistic load forecasting models have been proposed in recent decades. Compared with a single model, it is widely acknowledged that combining different models can further enhance the prediction performance, which is called the model ensemble. However, existing model ensemble approaches for load forecasting are linear combination-based, like mean value ensemble, weighted average ensemble, and quantile regression, and linear combinations may not fully utilize the advantages of different models, seriously limiting the performance of the model ensemble. We propose a learning ensemble approach that adopts the machine learning model to directly learn the optimal nonlinear combination from data. We theoretically demonstrate that the proposed learning ensemble approach can outperform conventional ensemble approaches. Based on the proposed learning ensemble model, we also introduce a Shapley value-based method to evaluate the contributions of each model to the model ensemble. The numerical studies on field load data verify the remarkable performance of our proposed approach.  相似文献   

8.
黄静  薛书田  肖进 《软科学》2017,(7):131-134
将半监督学习技术与多分类器集成模型Bagging相结合,构建类别分布不平衡环境下基于Bagging的半监督集成模型(SSEBI),综合利用有、无类别标签的样本来提高模型的性能.该模型主要包括三个阶段:(1)从无类别标签数据集中选择性标记一部分样本并训练若干个基本分类器;(2)使用训练好的基本分类器对测试集样本进行分类;(3)对分类结果进行集成得到最终分类结果.在五个客户信用评估数据集上进行实证分析,结果表明本研究提出的SSEBI模型的有效性.  相似文献   

9.
为使网络学习真正有效,需要新的学习和评估模式,鼓励、支持、引发积极心理因素,学习取向正是以此为出发点进行的研究。在比较国内外学习取向研究的基础上,重点阐述了MargaretMartinez等对学习取向的定义、分类、基于此的测量工具以及相关研究在教学中的应用,最后展望了Martinez等的学习取向研究趋势。  相似文献   

10.
Parkinson’s disease (PD) is a chronic neurodegenerative disease of that predominantly affects the elderly in today’s world. For the diagnosis of the early stages of PD, effective and powerful automated techniques are needed by recent enabling technologies as a tool. In this study, we present a comprehensive review of papers from 2013 to 2021 on the diagnosis of PD and its subtypes using artificial neural networks (ANNs) and deep neural networks (DNNs). We present detailed information and analysis regarding the usage of various modalities, datasets, architectures and experimental configurations in a succinct manner. We also present an in-depth comparative analysis of various proposed architectures. Finally, we present a number of relevant future directions for researchers in this area.  相似文献   

11.
This paper is concerned with the quality of training data in learning to rank for information retrieval. While many data selection techniques have been proposed to improve the quality of training data for classification, the study on the same issue for ranking appears to be insufficient. As pointed out in this paper, it is inappropriate to extend technologies for classification to ranking, and the development of novel technologies is sorely needed. In this paper, we study the development of such technologies. To begin with, we propose the concept of “pairwise preference consistency” (PPC) to describe the quality of a training data collection from the ranking point of view. PPC takes into consideration the ordinal relationship between documents as well as the hierarchical structure on queries and documents, which are both unique properties of ranking. Then we select a subset of the original training documents, by maximizing the PPC of the selected subset. We further propose an efficient solution to the maximization problem. Empirical results on the LETOR benchmark datasets and a web search engine dataset show that with the subset of training data selected by our approach, the performance of the learned ranking model can be significantly improved.  相似文献   

12.
This paper presents a semantically rich document representation model for automatically classifying financial documents into predefined categories utilizing deep learning. The model architecture consists of two main modules including document representation and document classification. In the first module, a document is enriched with semantics using background knowledge provided by an ontology and through the acquisition of its relevant terminology. Acquisition of terminology integrated to the ontology extends the capabilities of semantically rich document representations with an in depth-coverage of concepts, thereby capturing the whole conceptualization involved in documents. Semantically rich representations obtained from the first module will serve as input to the document classification module which aims at finding the most appropriate category for that document through deep learning. Three different deep learning networks each belonging to a different category of machine learning techniques for ontological document classification using a real-life ontology are used.Multiple simulations are carried out with various deep neural networks configurations, and our findings reveal that a three hidden layer feedforward network with 1024 neurons obtain the highest document classification performance on the INFUSE dataset. The performance in terms of F1 score is further increased by almost five percentage points to 78.10% for the same network configuration when the relevant terminology integrated to the ontology is applied to enrich document representation. Furthermore, we conducted a comparative performance evaluation using various state-of-the-art document representation approaches and classification techniques including shallow and conventional machine learning classifiers.  相似文献   

13.
多源信息融合技术在干旱区盐渍地信息提取中的应用   总被引:3,自引:0,他引:3  
土壤盐渍化是干旱区绿洲稳定与可持续发展面临的主要环境问题之一,因此借助遥感手段及时准确地提取盐渍地信息并掌握其空间分布有着重要的现实意义。本文以渭干河-库车河三角洲绿洲为例,使用RadarsatSAR与LandsatTM影像进行主成分融合,同时与HIS和Brovey变换的融合效果作定量比较,并利用BP神经网络模型,以相同的训练样本分别对融合前后的影像进行分类。结果表明:盐渍地主要分布在绿洲的和沙漠之间的交错带,盐渍地的分布在绿洲内部呈条形状分布,而在绿洲外部呈片状分布,且绿洲外部重度盐渍地交错分布在中轻度盐渍地中;主成份变换融合影像的光谱信息保持性、信息量都优于其它常用的融合方法,且分类精度比单一LANDSATTM多光谱影像有较大提高,是监测干旱区盐渍地变化的有效手段。  相似文献   

14.
An idiom is a common phrase that means something other than its literal meaning. Detecting idioms automatically is a serious challenge in natural language processing (NLP) domain applications like information retrieval (IR), machine translation and chatbot. Automatic detection of Idioms plays an important role in all these applications. A fundamental NLP task is text classification, which categorizes text into structured categories known as text labeling or categorization. This paper deals with idiom identification as a text classification task. Pre-trained deep learning models have been used for several text classification tasks; though models like BERT and RoBERTa have not been exclusively used for idiom and literal classification. We propose a predictive ensemble model to classify idioms and literals using BERT and RoBERTa, fine-tuned with the TroFi dataset. The model is tested with a newly created in house dataset of idioms and literal expressions, numbering 1470 in all, and annotated by domain experts. Our model outperforms the baseline models in terms of the metrics considered, such as F-score and accuracy, with a 2% improvement in accuracy.  相似文献   

15.
16.
Text classification is an important research topic in natural language processing (NLP), and Graph Neural Networks (GNNs) have recently been applied in this task. However, in existing graph-based models, text graphs constructed by rules are not real graph data and introduce massive noise. More importantly, for fixed corpus-level graph structure, these models cannot sufficiently exploit the labeled and unlabeled information of nodes. Meanwhile, contrastive learning has been developed as an effective method in graph domain to fully utilize the information of nodes. Therefore, we propose a new graph-based model for text classification named CGA2TC, which introduces contrastive learning with an adaptive augmentation strategy into obtaining more robust node representation. First, we explore word co-occurrence and document word relationships to construct a text graph. Then, we design an adaptive augmentation strategy for the text graph with noise to generate two contrastive views that effectively solve the noise problem and preserve essential structure. Specifically, we design noise-based and centrality-based augmentation strategies on the topological structure of text graph to disturb the unimportant connections and thus highlight the relatively important edges. As for the labeled nodes, we take the nodes with same label as multiple positive samples and assign them to anchor node, while we employ consistency training on unlabeled nodes to constrain model predictions. Finally, to reduce the resource consumption of contrastive learning, we adopt a random sample method to select some nodes to calculate contrastive loss. The experimental results on several benchmark datasets can demonstrate the effectiveness of CGA2TC on the text classification task.  相似文献   

17.
Dictionary-based classifiers are an essential group of approaches in the field of time series classification. Their distinctive characteristic is that they transform time series into segments made of symbols (words) and then classify time series using these words. Dictionary-based approaches are suitable for datasets containing time series of unequal length. The prevalence of dictionary-based methods inspired the research in this paper. We propose a new dictionary-based classifier called SAFE. The new approach transforms the raw numeric data into a symbolic representation using the Simple Symbolic Aggregate approXimation (SAX) method. We then partition the symbolic time series into a sequence of words. Then we employ the word embedding neural model known in Natural Language Processing to train the classifying mechanism. The proposed scheme was applied to classify 30 benchmark datasets and compared with a range of state-of-the-art time series classifiers. The name SAFE comes from our observation that this method is safe to use. Empirical experiments have shown that SAFE gives excellent results: it is always in the top 5%–10% when we rank the classification accuracy of state-of-the-art algorithms for various datasets. Our method ranks third in the list of state-of-the-art dictionary-based approaches (after the WEASEL and BOSS methods).  相似文献   

18.
随着网络的飞速发展,网页数量急剧膨胀,近几年来更是以指数级进行增长,搜索引擎面临的挑战越来越严峻,很难从海量的网页中准确快捷地找到符合用户需求的网页。网页分类是解决这个问题的有效手段之一,基于网页主题分类和基于网页体裁分类是网页分类的两大主流,二者有效地提高了搜索引擎的检索效率。网页体裁分类是指按照网页的表现形式及其用途对网页进行分类。介绍了网页体裁的定义,网页体裁分类研究常用的分类特征,并且介绍了几种常用特征筛选方法、分类模型以及分类器的评估方法,为研究者提供了对网页体裁分类的概要性了解。  相似文献   

19.
Stance detection identifies a person’s evaluation of a subject, and is a crucial component for many downstream applications. In application, stance detection requires training a machine learning model on an annotated dataset and applying the model on another to predict stances of text snippets. This cross-dataset model generalization poses three central questions, which we investigate using stance classification models on 7 publicly available English Twitter datasets ranging from 297 to 48,284 instances. (1) Are stance classification models generalizable across datasets? We construct a single dataset model to train/test dataset-against-dataset, finding models do not generalize well (avg F1=0.33). (2) Can we improve the generalizability by aggregating datasets? We find a multi dataset model built on the aggregation of datasets has an improved performance (avg F1=0.69). (3) Given a model built on multiple datasets, how much additional data is required to fine-tune it? We find it challenging to ascertain a minimum number of data points due to the lack of pattern in performance. Investigating possible reasons for the choppy model performance we find that texts are not easily differentiable by stances, nor are annotations consistent within and across datasets. Our observations emphasize the need for an aggregated dataset as well as consistent labels for the generalizability of models.  相似文献   

20.
Semi-supervised document retrieval   总被引:2,自引:0,他引:2  
This paper proposes a new machine learning method for constructing ranking models in document retrieval. The method, which is referred to as SSRank, aims to use the advantages of both the traditional Information Retrieval (IR) methods and the supervised learning methods for IR proposed recently. The advantages include the use of limited amount of labeled data and rich model representation. To do so, the method adopts a semi-supervised learning framework in ranking model construction. Specifically, given a small number of labeled documents with respect to some queries, the method effectively labels the unlabeled documents for the queries. It then uses all the labeled data to train a machine learning model (in our case, Neural Network). In the data labeling, the method also makes use of a traditional IR model (in our case, BM25). A stopping criterion based on machine learning theory is given for the data labeling process. Experimental results on three benchmark datasets and one web search dataset indicate that SSRank consistently and almost always significantly outperforms the baseline methods (unsupervised and supervised learning methods), given the same amount of labeled data. This is because SSRank can effectively leverage the use of unlabeled data in learning.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号