首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
One main challenge of Named Entities Recognition (NER) for tweets is the insufficient information in a single tweet, owing to the noisy and short nature of tweets. We propose a novel system to tackle this challenge, which leverages redundancy in tweets by conducting two-stage NER for multiple similar tweets. Particularly, it first pre-labels each tweet using a sequential labeler based on the linear Conditional Random Fields (CRFs) model. Then it clusters tweets to put tweets with similar content into the same group. Finally, for each cluster it refines the labels of each tweet using an enhanced CRF model that incorporates the cluster level information, i.e., the labels of the current word and its neighboring words across all tweets in the cluster. We evaluate our method on a manually annotated dataset, and show that our method boosts the F1 of the baseline without collectively labeling from 75.4% to 82.5%.  相似文献   

2.
Climate change has become one of the most significant crises of our time. Public opinion on climate change is influenced by social media platforms such as Twitter, often divided into believers and deniers. In this paper, we propose a framework to classify a tweet’s stance on climate change (denier/believer). Existing approaches to stance detection and classification of climate change tweets either have paid little attention to the characteristics of deniers’ tweets or often lack an appropriate architecture. However, the relevant literature reveals that the sentimental aspects and time perspective of climate change conversations on Twitter have a major impact on public attitudes and environmental orientation. Therefore, in our study, we focus on exploring the role of temporal orientation and sentiment analysis (auxiliary tasks) in detecting the attitude of tweets on climate change (main task). Our proposed framework STASY integrates word- and sentence-based feature encoders with the intra-task and shared-private attention frameworks to better encode the interactions between task-specific and shared features. We conducted our experiments on our novel curated climate change CLiCS dataset (2465 denier and 7235 believer tweets), two publicly available climate change datasets (ClimateICWSM-2022 and ClimateStance-2022), and two benchmark stance detection datasets (SemEval-2016 and COVID-19-Stance). Experiments show that our proposed approach improves stance detection performance (with an average improvement of 12.14% on our climate change dataset, 15.18% on ClimateICWSM-2022, 12.94% on ClimateStance-2022, 19.38% on SemEval-2016, and 35.01% on COVID-19-Stance in terms of average F1 scores) by benefiting from the auxiliary tasks compared to the baseline methods.  相似文献   

3.
Discriminative sentence compression with conditional random fields   总被引:2,自引:0,他引:2  
The paper focuses on a particular approach to automatic sentence compression which makes use of a discriminative sequence classifier known as Conditional Random Fields (CRF). We devise several features for CRF that allow it to incorporate information on nonlinear relations among words. Along with that, we address the issue of data paucity by collecting data from RSS feeds available on the Internet, and turning them into training data for use with CRF, drawing on techniques from biology and information retrieval. We also discuss a recursive application of CRF on the syntactic structure of a sentence as a way of improving the readability of the compression it generates. Experiments found that our approach works reasonably well compared to the state-of-the-art system [Knight, K., & Marcu, D. (2002). Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence 139, 91–107.].  相似文献   

4.
The content generation strategy of a sports franchise determines whether the user engagement increases or decreases on social media platforms. Thus, the role of Chief Operating Officer (COO) is profound who generally decides and governs social media policies of the franchises. We show that the cultural differences between local-COO vis-à-vis foreign-COO-governed sports franchises reflect in their content generation strategy and are also associated with user engagement. We use Hofstede's cultural dimensions theory and extract relevant features from the tweets. Overall, the results show that user engagement is more when the content generation strategy is in alignment with fans’ national culture. The first contribution of our work is towards showing the incremental impact of power distance, individualism and collectivism on user engagement. The second contribution of our work is towards feature construction, feature selection and building authorship attribution classifiers to understand the content generation strategy. Prior literature shows that national culture impacts writing of online reviews. We investigate the role of national culture in social media content generation and user engagement and extend the literature. Our study is useful for organizations to understand the role of national culture in content generation and how it is related to user engagement.  相似文献   

5.
Stance detection is to distinguish whether the text’s author supports, opposes, or maintains a neutral stance towards a given target. In most real-world scenarios, stance detection needs to work in a zero-shot manner, i.e., predicting stances for unseen targets without labeled data. One critical challenge of zero-shot stance detection is the absence of contextual information on the targets. Current works mostly concentrate on introducing external knowledge to supplement information about targets, but the noisy schema-linking process hinders their performance in practice. To combat this issue, we argue that previous studies have ignored the extensive target-related information inhabited in the unlabeled data during the training phase, and propose a simple yet efficient Multi-Perspective Contrastive Learning Framework for zero-shot stance detection. Our framework is capable of leveraging information not only from labeled data but also from extensive unlabeled data. To this end, we design target-oriented contrastive learning and label-oriented contrastive learning to capture more comprehensive target representation and more distinguishable stance features. We conduct extensive experiments on three widely adopted datasets (from 4870 to 33,090 instances), namely SemEval-2016, WT-WT, and VAST. Our framework achieves 53.6%, 77.1%, and 72.4% macro-average F1 scores on these three datasets, showing 2.71% and 0.25% improvements over state-of-the-art baselines on the SemEval-2016 and WT-WT datasets and comparable results on the more challenging VAST dataset.  相似文献   

6.
With the onset of COVID-19, the pandemic has aroused huge discussions on social media like Twitter, followed by many social media analyses concerning it. Despite such an abundance of studies, however, little work has been done on reactions from the public and officials on social networks and their associations, especially during the early outbreak stage. In this paper, a total of 9,259,861 COVID-19-related English tweets published from 31 December 2019 to 11 March 2020 are accumulated for exploring the participatory dynamics of public attention and news coverage during the early stage of the pandemic. An easy numeric data augmentation (ENDA) technique is proposed for generating new samples while preserving label validity. It attains superior performance on text classification tasks with deep models (BERT) than an easier data augmentation method. To demonstrate the efficacy of ENDA further, experiments and ablation studies have also been implemented on other benchmark datasets. The classification results of COVID-19 tweets show tweets peaks trigged by momentous events and a strong positive correlation between the daily number of personal narratives and news reports. We argue that there were three periods divided by the turning points on January 20 and February 23 and the low level of news coverage suggests the missed windows for government response in early January and February. Our study not only contributes to a deeper understanding of the dynamic patterns and relationships of public attention and news coverage on social media during the pandemic but also sheds light on early emergency management and government response on social media during global health crises.  相似文献   

7.
Named entity recognition (NER) is mostly formalized as a sequence labeling problem in which segments of named entities are represented by label sequences. Although a considerable effort has been made to investigate sophisticated features that encode textual characteristics of named entities (e.g. PEOPLE, LOCATION, etc.), little attention has been paid to segment representations (SRs) for multi-token named entities (e.g. the IOB2 notation). In this paper, we investigate the effects of different SRs on NER tasks, and propose a feature generation method using multiple SRs. The proposed method allows a model to exploit not only highly discriminative features of complex SRs but also robust features of simple SRs against the data sparseness problem. Since it incorporates different SRs as feature functions of Conditional Random Fields (CRFs), we can use the well-established procedure for training. In addition, the tagging speed of a model integrating multiple SRs can be accelerated equivalent to that of a model using only the most complex SR of the integrated model. Experimental results demonstrate that incorporating multiple SRs into a single model improves the performance and the stability of NER. We also provide the detailed analysis of the results.  相似文献   

8.
Stance detection identifies a person’s evaluation of a subject, and is a crucial component for many downstream applications. In application, stance detection requires training a machine learning model on an annotated dataset and applying the model on another to predict stances of text snippets. This cross-dataset model generalization poses three central questions, which we investigate using stance classification models on 7 publicly available English Twitter datasets ranging from 297 to 48,284 instances. (1) Are stance classification models generalizable across datasets? We construct a single dataset model to train/test dataset-against-dataset, finding models do not generalize well (avg F1=0.33). (2) Can we improve the generalizability by aggregating datasets? We find a multi dataset model built on the aggregation of datasets has an improved performance (avg F1=0.69). (3) Given a model built on multiple datasets, how much additional data is required to fine-tune it? We find it challenging to ascertain a minimum number of data points due to the lack of pattern in performance. Investigating possible reasons for the choppy model performance we find that texts are not easily differentiable by stances, nor are annotations consistent within and across datasets. Our observations emphasize the need for an aggregated dataset as well as consistent labels for the generalizability of models.  相似文献   

9.
Social media platforms such as Twitter provide convenient ways to share and consume important information during disasters and emergencies. Information from bystanders and eyewitnesses can be useful for law enforcement agencies and humanitarian organizations to get firsthand and credible information about an ongoing situation to gain situational awareness among other potential uses. However, the identification of eyewitness reports on Twitter is a challenging task. This work investigates different types of sources on tweets related to eyewitnesses and classifies them into three types (i) direct eyewitnesses, (ii) indirect eyewitnesses, and (iii) vulnerable eyewitnesses. Moreover, we investigate various characteristics associated with each kind of eyewitness type. We observe that words related to perceptual senses (feeling, seeing, hearing) tend to be present in direct eyewitness messages, whereas emotions, thoughts, and prayers are more common in indirect witnesses. We use these characteristics and labeled data to train several machine learning classifiers. Our results performed on several real-world Twitter datasets reveal that textual features (bag-of-words) when combined with domain-expert features achieve better classification performance. Our approach contributes a successful example for combining crowdsourced and machine learning analysis, and increases our understanding and capability of identifying valuable eyewitness reports during disasters.  相似文献   

10.
Stance is defined as the expression of a speaker's standpoint towards a given target or entity. To date, the most reliable method for measuring stance is opinion surveys. However, people's increased reliance on social media makes these online platforms an essential source of complementary information about public opinion. Our study contributes to the discussion surrounding replicable methods through which to conduct reliable stance detection by establishing a rule-based model, which we replicated for several targets independently. To test our model, we relied on a widely used dataset of annotated tweets - the SemEval Task 6A dataset, which contains 5 targets with 4,163 manually labelled tweets. We relied on “off-the-shelf” sentiment lexica to expand the scope of our custom dictionaries, while also integrating linguistic markers and using word-pairs dependency information to conduct stance classification. While positive and negative evaluative words are the clearest markers of expression of stance, we demonstrate the added value of linguistic markers to identify the direction of the stance more precisely. Our model achieves an average classification accuracy of 75% (ranging from 67% to 89% across targets). This study is concluded by discussing practical implications and outlooks for future research, while highlighting that each target poses specific challenges to stance detection.  相似文献   

11.
Users’ ability to retweet information has made Twitter one of the most prominent social media platforms for disseminating emergency information during disasters. However, few studies have examined how Twitter’s features can support the different communication patterns that occur during different phases of disaster events. Based on the literature of disaster communication and Media Synchronicity Theory, we identify distinct disaster phases and the two communication types—crisis communication and risk communication—that occur during those phases. We investigate how Twitter’s representational features, including words, URLs, hashtags, and hashtag importance, influence the average retweet time—that is, the average time it takes for retweet to occur—as well as how such effects differ depending on the type of disaster communication. Our analysis of tweets from the 2013 Colorado floods found that adding more URLs to tweets increases the average retweet time more in risk-related tweets than it does in crisis-related tweets. Further, including key disaster-related hashtags in tweets contributed to faster retweets in crisis-related tweets than in risk-related tweets. Our findings suggest that the influence of Twitter’s media capabilities on rapid tweet propagation during disasters may differ based on the communication processes.  相似文献   

12.
相对于传统的产品领域意见挖掘研究,文章对中文通用领域的意见挖掘各部分内容进行了尝试性研究。利用基于多种语言特征和候选评价对象的条件随机场模型进行观点表达抽取,对有窗口限制的最近邻方法进行改进,提出一种评价对象—观点表达对的匹配算法,其对评价对象抽取效果也进行了进一步的修正。  相似文献   

13.
Much of the valuable information in supporting decision making processes originates in text-based documents. Although these documents can be effectively searched and ranked by modern search engines, actionable knowledge need to be extracted and transformed in a structured form before being used in a decision process. In this paper we describe how the discovery of semantic information embedded in natural language documents can be viewed as an optimization problem aimed at assigning a sequence of labels (hidden states) to a set of interdependent variables (textual tokens). Dependencies among variables are efficiently modeled through Conditional Random Fields, an indirected graphical model able to represent the distribution of labels given a set of observations. The Markov property of these models prevent them to take into account long-range dependencies among variables, which are indeed relevant in Natural Language Processing. In order to overcome this limitation we propose an inference method based on Integer Programming formulation of the problem, where long distance dependencies are included through non-deterministic soft constraints.  相似文献   

14.
Health misinformation has become an unfortunate truism of social media platforms, where lies could spread faster than truth. Despite considerable work devoted to suppressing fake news, health misinformation, including low-quality health news, persists and even increases in recent years. One promising approach to fighting bad information is studying the temporal and sentiment effects of health news stories and how they are discussed and disseminated on social media platforms like Twitter. As part of the effort of searching for innovative ways to fight health misinformation, this study analyzes a dataset of more than 1600 objectively and independently reviewed health news stories published over a 10-year span and nearly 50,000 Twitter posts responding to them. Specifically, it examines the source credibility of health news circulated on Twitter and the temporal, sentiment features of the tweets containing or responding to the health news reports. The results show that health news stories that are rated low by experts are discussed more, persist longer, and produce stronger sentiments than highly rated ones in the tweetosphere. However, the highly rated stories retained a fresh interest in the form of new tweets for a longer period. An in-depth understanding of the characteristics of health news distribution and discussion is the first step toward mitigating the surge of health misinformation. The findings provide insights into understanding the mechanism of health information dissemination on social media and practical implications to fight and mitigate health misinformation on digital media platforms.  相似文献   

15.
In the context of social media, users usually post relevant information corresponding to the contents of events mentioned in a Web document. This information posses two important values in that (i) it reflects the content of an event and (ii) it shares hidden topics with sentences in the main document. In this paper, we present a novel model to capture the nature of relationships between document sentences and post information (comments or tweets) in sharing hidden topics for summarization of Web documents by utilizing relevant post information. Unlike previous methods which are usually based on hand-crafted features, our approach ranks document sentences and user posts based on their importance to the topics. The sentence-user-post relation is formulated in a share topic matrix, which presents their mutual reinforcement support. Our proposed matrix co-factorization algorithm computes the score of each document sentence and user post and extracts the top ranked document sentences and comments (or tweets) as a summary. We apply the model to the task of summarization on three datasets in two languages, English and Vietnamese, of social context summarization and also on DUC 2004 (a standard corpus of the traditional summarization task). According to the experimental results, our model significantly outperforms the basic matrix factorization and achieves competitive ROUGE-scores with state-of-the-art methods.  相似文献   

16.
This article describes in-depth research on machine learning methods for sentiment analysis of Czech social media. Whereas in English, Chinese, or Spanish this field has a long history and evaluation datasets for various domains are widely available, in the case of the Czech language no systematic research has yet been conducted. We tackle this issue and establish a common ground for further research by providing a large human-annotated Czech social media corpus. Furthermore, we evaluate state-of-the-art supervised machine learning methods for sentiment analysis. We explore different pre-processing techniques and employ various features and classifiers. We also experiment with five different feature selection algorithms and investigate the influence of named entity recognition and preprocessing on sentiment classification performance. Moreover, in addition to our newly created social media dataset, we also report results for other popular domains, such as movie and product reviews. We believe that this article will not only extend the current sentiment analysis research to another family of languages, but will also encourage competition, potentially leading to the production of high-end commercial solutions.  相似文献   

17.
Rapid communication during extreme events is one of the critical aspects of successful disaster management strategies. Due to their ubiquitous nature, social media platforms are expected to offer a unique opportunity for crisis communication. In this study, about 52.5 million tweets related to hurricane Sandy posted by 13.75 million users are analyzed to assess the effectiveness of social media communication during disasters and identify the contributing factors leading to effective crisis communication strategies. Efficiency of a social media user is defined as the ratio of attention gained over the number of tweets posted. A model is developed to identify more efficient users based on several relevant features. Results indicate that during a disaster event, only few social media users become highly efficient in gaining attention. In addition, efficiency does not depend on the frequency of tweeting activity only; instead it depends on the number of followers and friends, user category, bot score (controlled by a human or a machine), and activity patterns (predictability of activity frequency). Since the proposed efficiency metric is easy to evaluate, it can potentially detect effective social media users in real time to communicate information and awareness to vulnerable communities during a disaster.  相似文献   

18.
The research field of crisis informatics examines, amongst others, the potentials and barriers of social media use during disasters and emergencies. Social media allow emergency services to receive valuable information (e.g., eyewitness reports, pictures, or videos) from social media. However, the vast amount of data generated during large-scale incidents can lead to issue of information overload. Research indicates that supervised machine learning techniques are suitable for identifying relevant messages and filter out irrelevant messages, thus mitigating information overload. Still, they require a considerable amount of labeled data, clear criteria for relevance classification, a usable interface to facilitate the labeling process and a mechanism to rapidly deploy retrained classifiers. To overcome these issues, we present (1) a system for social media monitoring, analysis and relevance classification, (2) abstract and precise criteria for relevance classification in social media during disasters and emergencies, (3) the evaluation of a well-performing Random Forest algorithm for relevance classification incorporating metadata from social media into a batch learning approach (e.g., 91.28%/89.19% accuracy, 98.3%/89.6% precision and 80.4%/87.5% recall with a fast training time with feature subset selection on the European floods/BASF SE incident datasets), as well as (4) an approach and preliminary evaluation for relevance classification including active, incremental and online learning to reduce the amount of required labeled data and to correct misclassifications of the algorithm by feedback classification. Using the latter approach, we achieved a well-performing classifier based on the European floods dataset by only requiring a quarter of labeled data compared to the traditional batch learning approach. Despite a lesser effect on the BASF SE incident dataset, still a substantial improvement could be determined.  相似文献   

19.
The paper presents new annotated corpora for performing stance detection on Spanish Twitter data, most notably Health-related tweets. The objectives of this research are threefold: (1) to develop a manually annotated benchmark corpus for emotion recognition taking into account different variants of Spanish in social posts; (2) to evaluate the efficiency of semi-supervised models for extending such corpus with unlabelled posts; and (3) to describe such short text corpora via specialised topic modelling.A corpus of 2,801 tweets about COVID-19 vaccination was annotated by three native speakers to be in favour (904), against (674) or neither (1,223) with a 0.725 Fleiss’ kappa score. Results show that the self-training method with SVM base estimator can alleviate annotation work while ensuring high model performance. The self-training model outperformed the other approaches and produced a corpus of 11,204 tweets with a macro averaged f1 score of 0.94. The combination of sentence-level deep learning embeddings and density-based clustering was applied to explore the contents of both corpora. Topic quality was measured in terms of the trustworthiness and the validation index.  相似文献   

20.
Social media has become the most popular platform for free speech. This freedom of speech has given opportunities to the oppressed to raise their voice against injustices, but on the other hand, this has led to a disturbing trend of spreading hateful content of various kinds. Pakistan has been dealing with the issue of sectarian and ethnic violence for the last three decades and now due to freedom of speech, there is a growing trend of disturbing content about religion, sect, and ethnicity on social media. This necessitates the need for an automated system for the detection of controversial content on social media in Urdu which is the national language of Pakistan. The biggest hurdle that has thwarted the Urdu language processing is the scarcity of language resources, annotated datasets, and pretrained language models. In this study, we have addressed the problem of detecting Interfaith, Sectarian, and Ethnic hatred on social media in Urdu language using machine learning and deep learning techniques. In particular, we have: (1) developed and presented guidelines for annotating Urdu text with appropriate labels for two levels of classification, (2) developed a large dataset of 21,759 tweets using the developed guidelines and made it publicly available, and (3) conducted experiments to compare the performance of eight supervised machine learning and deep learning techniques, for the automated identification of hateful content. In the first step, experiments are performed for the hateful content detection as a binary classification task, and in the second step, the classification of Interfaith, Sectarian and Ethnic hatred detection is performed as a multiclass classification task. Overall, Bidirectional Encoder Representation from Transformers (BERT) proved to be the most effective technique for hateful content identification in Urdu tweets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号