首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
While geographical metadata referring to the originating locations of tweets provides valuable information to perform effective spatial analysis in social networks, scarcity of such geotagged tweets imposes limitations on their usability. In this work, we propose a content-based location prediction method for tweets by analyzing the geographical distribution of tweet texts using Kernel Density Estimation (KDE). The primary novelty of our work is to determine different settings of kernel functions for every term in tweets based on the location indicativeness of these terms. Our proposed method, which we call locality-adapted KDE, uses information-theoretic metrics and does not require any parameter tuning for these settings. As a further enhancement on the term-level distribution model, we describe an analysis of spatial point patterns in tweet texts in order to identify bigrams that exhibit significant deviation from the underlying unigram patterns. We present an expansion of feature space using the selected bigrams and show that it eventually yields further improvement in prediction accuracy of our locality-adapted KDE. We demonstrate that our expansion results in a limited increase in the size of feature space and it does not hinder online localization of tweets. The methods we propose rely purely on statistical approaches without requiring any language-specific setting. Experiments conducted on three tweet sets from different countries show that our proposed solution outperforms existing state-of-the-art techniques, yielding significantly more accurate predictions.  相似文献   

2.
The widespread popularity and worldwide application of social networks have raised interest in the analysis of content created on the networks. One such analytical application and aspect of social networks, including Twitter, is identifying the location of various political and social events, natural disasters and so on. The present study focuses on the localization of traffic accidents. Outdated and inaccurate information in user profiles, the absence of location data in tweet texts, and the limited number of geotagged posts are among the challenges tackled by location estimation. Adopting the Dempster–Shafer Evidence Theory, the present study estimates the location of accidents using a combination of user profiles, tweet texts, and the place attachments in tweets. The results indicate improved performance regarding error distance and average error distance compared to previously developed methods. The proposed method in this study resulted in a reduced error distance of 26%.  相似文献   

3.
Politicians’ tweets can have important political and economic implications. However, limited context makes it hard for readers to instantly and precisely understand them, especially from a causal perspective. The triggers for these tweets may have been reported in news prior to the tweets, but simply finding similar news articles would not serve the purpose, given the following reasons. First, readers may only be interested in finding the reasons and contexts (we call causal backgrounds) for a certain part of a tweet. Intuitively, such content would be politically relevant and accord with public’s recent attention, which is not usually reflected within the context. Besides, the content should be human-readable, while the noisy and informal nature of tweets hinders regular Open Information Extraction systems. Second, similarity does not capture causality and the causality between tweet contents and news contents is beyond the scopes of causality extraction tools. Meanwhile, it will be non-trivial to construct a high-quality tweet-to-intent dataset.We propose the first end-to-end framework for discovering causal backgrounds of politicians’ tweets by: 1. Designing an Open IE system considering rule-free representations for tweets; 2. Introducing sources like Wikipedia linkage and edit history to identify focal contents; 3. Finding implicit causalities between different contexts using explicit causalities learned elsewhere. We curate a comprehensive dataset of interpretations from political journalists for 533 tweets from 5 US politicians. On average, we obtain the correct answers within top-2 recommendations. We make our dataset and framework code publicly available.  相似文献   

4.
5.
Social networks like Twitter are good means for people to express themselves and ask for help in times of crisis. However, to provide help, authorities need to identify informative posts on the network from the vast amount of non-informative ones to better know what is actually happening. Traditional methods for identifying informative posts put emphasis on the presence or absence of certain words which has limitations for classifying these posts. In contrast, in this paper, we propose to consider the (overall) distribution of words in the post. To do this, based on the distributional hypothesis in linguistics, we assume that each tweet is a distribution from which we have drawn a sample of words. Building on recent developments in learning methods, namely learning on distributions, we propose an approach which identifies informative tweets by using distributional assumption. Extensive experiments have been performed on Twitter data from more than 20 crisis incidents of nearly all types of incidents. These experiments show the superiority of the proposed approach in a number of real crisis incidents. This implies that better modelling of the content of a tweet based on recent advances in estimating distributions and using domain-specific knowledge for various types of crisis incidents such as floods or earthquakes, may help to achieve higher accuracy in the task.  相似文献   

6.
One main challenge of Named Entities Recognition (NER) for tweets is the insufficient information in a single tweet, owing to the noisy and short nature of tweets. We propose a novel system to tackle this challenge, which leverages redundancy in tweets by conducting two-stage NER for multiple similar tweets. Particularly, it first pre-labels each tweet using a sequential labeler based on the linear Conditional Random Fields (CRFs) model. Then it clusters tweets to put tweets with similar content into the same group. Finally, for each cluster it refines the labels of each tweet using an enhanced CRF model that incorporates the cluster level information, i.e., the labels of the current word and its neighboring words across all tweets in the cluster. We evaluate our method on a manually annotated dataset, and show that our method boosts the F1 of the baseline without collectively labeling from 75.4% to 82.5%.  相似文献   

7.
Journalists, emergency responders, and the general public use Twitter during disasters as an effective means to disseminate emergency information. However, there is a growing concern about the credibility of disaster tweets. This concern negatively influences Twitter users’ decisions about whether to retweet information, which can delay the dissemination of accurate—and sometimes essential—communications during a crisis. Although verifying information credibility is often a time-consuming task requiring considerable cognitive effort, researchers have yet to explore how people manage this task while using Twitter during disaster situations.To address this, we adopt the Heuristic-Systematic Model of information processing to understand how Twitter users make retweet decisions by categorizing tweet content as systematically processed information and a Twitter user’s profile as heuristically processed information. We then empirically examine tweet content and Twitter user profiles, as well as how they interact to verify the credibility of tweets collected during two disaster events: the 2011 Queensland floods, and the 2013 Colorado floods. Our empirical results suggest that using a Twitter profile as source-credibility information makes it easier for Twitter users to assess the credibility of disaster tweets. Our study also reveals that the Twitter user profile is a reliable source of credibility information and enhances our understanding of timely communication on Twitter during disasters.  相似文献   

8.
User location data is valuable for diverse social media analytics. In this paper, we address the non-trivial task of estimating a worldwide city-level Twitter user location considering only historical tweets. We propose a purely unsupervised approach that is based on a synthetic geographic sampling of Google Trends (GT) city-level frequencies of tweet nouns and three clustering algorithms. The approach was validated empirically by using a recently collected dataset, with 3,268 worldwide city-level locations of Twitter users, obtaining competitive results when compared with a state-of-the-art Word Distribution (WD) user location estimation method. The best overall results were achieved by the GT noun DBSCAN (GTN-DB) method, which is computationally fast, and correctly predicts the ground truth locations of 15%, 23%, 39% and 58% of the users for tolerance distances of 250 km, 500 km, 1,000 km and 2,000 km.  相似文献   

9.
Unstructured tweet feeds are becoming the source of real-time information for various events. However, extracting actionable information in real-time from this unstructured text data is a challenging task. Hence, researchers are employing word embedding approach to classify unstructured text data. We set our study in the contexts of the 2014 Ebola and 2016 Zika outbreaks and probed the accuracy of domain-specific word vectors for identifying crisis-related actionable tweets. Our findings suggest that relatively smaller domain-specific input corpora from the Twitter corpus are better in extracting meaningful semantic relationship than generic pre-trained Word2Vec (contrived from Google News) or GloVe (of Stanford NLP group). However, domain-specific quality tweet corpora during the early stages of outbreaks are normally scant, and identifying actionable tweets during early stages is crucial to stemming the proliferation of an outbreak. To overcome this challenge, we consider scholarly abstracts, related to Ebola and Zika virus, from PubMed and probe the efficiency of cross-domain resource utilization for word vector generation. Our findings demonstrate that the relevance of PubMed abstracts for the training purpose when Twitter data (as input corpus) would be scant during the early stages of the outbreak. Thus, this approach can be implemented to handle future outbreaks in real time. We also explore the accuracy of our word vectors for various model architectures and hyper-parameter settings. We observe that Skip-gram accuracies are better than CBOW, and higher dimensions yield better accuracy.  相似文献   

10.
The emergence of social media and the huge amount of opinions that are posted everyday have influenced online reputation management. Reputation experts need to filter and control what is posted online and, more importantly, determine if an online post is going to have positive or negative implications towards the entity of interest. This task is challenging, considering that there are posts that have implications on an entity's reputation but do not express any sentiment. In this paper, we propose two approaches for propagating sentiment signals to estimate reputation polarity of tweets. The first approach is based on sentiment lexicons augmentation, whereas the second is based on direct propagation of sentiment signals to tweets that discuss the same topic. In addition, we present a polar fact filter that is able to differentiate between reputation-bearing and reputation-neutral tweets. Our experiments indicate that weakly supervised annotation of reputation polarity is feasible and that sentiment signals can be propagated to effectively estimate the reputation polarity of tweets. Finally, we show that learning PMI values from the training data is the most effective approach for reputation polarity analysis.  相似文献   

11.
Recently, geolocalisation of tweets has become important for a wide range of real-time applications, including real-time event detection, topic detection or disaster and emergency analysis. However, the number of relevant geotagged tweets available to enable such tasks remains insufficient. To overcome this limitation, predicting the location of non-geotagged tweets, while challenging, can increase the sample of geotagged data and has consequences for a wide range of applications. In this paper, we propose a location inference method that utilises a ranking approach combined with a majority voting of tweets, where each vote is weighted based on evidence gathered from the ranking. Using geotagged tweets from two cities, Chicago and New York (USA), our experimental results demonstrate that our method (statistically) significantly outperforms state-of-the-art baselines in terms of accuracy and error distance, in both cities, with the cost of decreased coverage. Finally, we investigated the applicability of our method in a real-time scenario by means of a traffic incident detection task. Our analysis shows that our fine-grained geolocalisation method can overcome the limitations of geotagged tweets and precisely map incident-related tweets at the real location of the incident.  相似文献   

12.
As COVID-19 swept over the world, people discussed facts, expressed opinions, and shared sentiments about the pandemic on social media. Since policies such as travel restriction and lockdown in reaction to COVID-19 were made at different levels of the society (e.g., schools and employers) and the government, we build a large geo-tagged Twitter dataset titled UsaGeoCov19 and perform an exploratory analysis by geographic location. Specifically, we collect 650,563 unique geo-tagged tweets across the United States covering the date range from January 25 to May 10, 2020. Tweet locations enable us to conduct region-specific studies such as tweeting volumes and sentiment, sometimes in response to local regulations and reported COVID-19 cases. During this period, many people started working from home. The gap between workdays and weekends in hourly tweet volumes inspire us to propose algorithms to estimate work engagement during the COVID-19 crisis. This paper also summarizes themes and topics of tweets in our dataset using both social media exclusive tools (i.e., #hashtags, @mentions) and the latent Dirichlet allocation model. We welcome requests for data sharing and conversations for more insights.UsaGeoCov19 link: http://yunhefeng.me/geo-tagged_twitter_datasets/.  相似文献   

13.
Users’ ability to retweet information has made Twitter one of the most prominent social media platforms for disseminating emergency information during disasters. However, few studies have examined how Twitter’s features can support the different communication patterns that occur during different phases of disaster events. Based on the literature of disaster communication and Media Synchronicity Theory, we identify distinct disaster phases and the two communication types—crisis communication and risk communication—that occur during those phases. We investigate how Twitter’s representational features, including words, URLs, hashtags, and hashtag importance, influence the average retweet time—that is, the average time it takes for retweet to occur—as well as how such effects differ depending on the type of disaster communication. Our analysis of tweets from the 2013 Colorado floods found that adding more URLs to tweets increases the average retweet time more in risk-related tweets than it does in crisis-related tweets. Further, including key disaster-related hashtags in tweets contributed to faster retweets in crisis-related tweets than in risk-related tweets. Our findings suggest that the influence of Twitter’s media capabilities on rapid tweet propagation during disasters may differ based on the communication processes.  相似文献   

14.
Information filtering has been a major task of study in the field of information retrieval (IR) for a long time, focusing on filtering well-formed documents such as news articles. Recently, more interest was directed towards applying filtering tasks to user-generated content such as microblogs. Several earlier studies investigated microblog filtering for focused topics. Another vital filtering scenario in microblogs targets the detection of posts that are relevant to long-standing broad and dynamic topics, i.e., topics spanning several subtopics that change over time. This type of filtering in microblogs is essential for many applications such as social studies on large events and news tracking of temporal topics. In this paper, we introduce an adaptive microblog filtering task that focuses on tracking topics of broad and dynamic nature. We propose an entirely-unsupervised approach that adapts to new aspects of the topic to retrieve relevant microblogs. We evaluated our filtering approach using 6 broad topics, each tested on 4 different time periods over 4 months. Experimental results showed that, on average, our approach achieved 84% increase in recall relative to the baseline approach, while maintaining an acceptable precision that showed a drop of about 8%. Our filtering method is currently implemented on TweetMogaz, a news portal generated from tweets. The website compiles the stream of Arabic tweets and detects the relevant tweets to different regions in the Middle East to be presented in the form of comprehensive reports that include top stories and news in each region.  相似文献   

15.
In reputation management, knowing what impact a tweet has on the reputation of a brand or company is crucial. The reputation polarity of a tweet is a measure of how the tweet influences the reputation of a brand or company. We consider the task of automatically determining the reputation polarity of a tweet. For this classification task, we propose a feature-based model based on three dimensions: the source of the tweet, the contents of the tweet and the reception of the tweet, i.e., how the tweet is being perceived. For evaluation purposes, we make use of the RepLab 2012 and 2013 datasets. We study and contrast three training scenarios. The first is independent of the entity whose reputation is being managed, the second depends on the entity at stake, but has over 90% fewer training samples per model, on average. The third is dependent on the domain of the entities. We find that reputation polarity is different from sentiment and that having less but entity-dependent training data is significantly more effective for predicting the reputation polarity of a tweet than an entity-independent training scenario. Features related to the reception of a tweet perform significantly better than most other features.  相似文献   

16.
The paper presents new annotated corpora for performing stance detection on Spanish Twitter data, most notably Health-related tweets. The objectives of this research are threefold: (1) to develop a manually annotated benchmark corpus for emotion recognition taking into account different variants of Spanish in social posts; (2) to evaluate the efficiency of semi-supervised models for extending such corpus with unlabelled posts; and (3) to describe such short text corpora via specialised topic modelling.A corpus of 2,801 tweets about COVID-19 vaccination was annotated by three native speakers to be in favour (904), against (674) or neither (1,223) with a 0.725 Fleiss’ kappa score. Results show that the self-training method with SVM base estimator can alleviate annotation work while ensuring high model performance. The self-training model outperformed the other approaches and produced a corpus of 11,204 tweets with a macro averaged f1 score of 0.94. The combination of sentence-level deep learning embeddings and density-based clustering was applied to explore the contents of both corpora. Topic quality was measured in terms of the trustworthiness and the validation index.  相似文献   

17.
Increased usage of bots through the Internet in general, and social networks in particular, has many implications related to influencing public opinion. Mechanisms to distinguish humans from machines span a broad spectrum of applications and hence vary in their nature and complexity. Here we use several public Twitter datasets to build a model that can predict whether or not an account is a bot account based on features extracted at the tweet or the account level. We then apply the model to Twitter's Russian Troll Tweets dataset. At the account level, we evaluate features related to how often Twitter accounts are tweeting, as previous research has shown that bots are very active at some account levels and very low at others. At the tweet level, we noticed that bot accounts tend to sound more formal or structured, whereas real user accounts tend to be more informal in that they contain more slang, slurs, cursing, and the like. We also noted that bots can be created for a range of different goals (e.g., marketing and politics) and that their behaviors vary based on those distinct goals. Ultimately, for high bot-prediction accuracy, models should consider and distinguish among the different goals for which bots are created.  相似文献   

18.
Satisfying non-trivial information needs involves collecting information from multiple resources, and synthesizing an answer that organizes that information. Traditional recall/precision-oriented information retrieval focuses on just one phase of that process: how to efficiently and effectively identify documents likely to be relevant to a specific, focused query. The TREC Interactive Track has as its goal the location of documents that pertain to different instances of a query topic, with no reward for duplicated coverage of topic instances. This task is similar to the task of organizing answer components into a complete answer. Clustering and classification are two mechanisms for organizing documents into groups. In this paper, we present an ongoing series of experiments that test the feasibility and effectiveness of using clustering and classification as an aid to instance retrieval and, ultimately, answer construction. Our results show that users prefer such structured presentations of candidate result set to a list-based approach. Assessment of the structured organizations based on the subjective judgement of the experiment subjects suggests that the structured organization can be more effective; however, assessment based on objective judgements shows mixed results. These results indicate that a full determination of the success of the approach depends on assessing the quality of the final answers generated by users, rather than on performance during the intermediate stages of answer construction.  相似文献   

19.
The rising popularity of social media posts, most notably Twitter posts, as a data source for social science research poses significant problems with regard to access to representative, high-quality data for analysis. Cheap, publicly available data such as that obtained from Twitter's public application programming interfaces is often of low quality, while high-quality data is expensive both financially and computationally. Moreover, data is often available only in real-time, making post-hoc analysis difficult or impossible. We propose and test a methodology for inexpensively creating an archive of Twitter data through population sampling, yielding a database that is highly representative of the targeted user population (in this test case, the entire population of Japanese-language Twitter users). Comparing the tweet volume, keywords, and topics found in our sample data set with the ground truth of Twitter's full data feed confirmed a very high degree of representativeness in the sample. We conclude that this approach yields a data set that is suitable for a wide range of post-hoc analyses, while remaining cost effective and accessible to a wide range of researchers.  相似文献   

20.
Coronavirus related discussions have spiraled at an exponential rate since its initial outbreak. By the end of May, more than 6 million people were diagnosed with this infection. Twitter witnessed an outpouring of anxious tweets through messages associated with the spread of the virus. Government and health officials replied to the troubling tweets, reassuring the public with regular alerts on the virus's progress and information to defend against the virus. We observe that social media users are worried about Covid 19-related crisis and we identify three separate conversations on virus contagion, prevention, and the economy. We analyze the tone of officials’ tweet text as alarming and reassuring and capture the response of Twitter users to official communications. Such studies can provide insights to health officials and government agencies for crisis management, specifically regarding communicating emergency information to the public via social media for establishing reassurance.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号