首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
Information filtering has been a major task of study in the field of information retrieval (IR) for a long time, focusing on filtering well-formed documents such as news articles. Recently, more interest was directed towards applying filtering tasks to user-generated content such as microblogs. Several earlier studies investigated microblog filtering for focused topics. Another vital filtering scenario in microblogs targets the detection of posts that are relevant to long-standing broad and dynamic topics, i.e., topics spanning several subtopics that change over time. This type of filtering in microblogs is essential for many applications such as social studies on large events and news tracking of temporal topics. In this paper, we introduce an adaptive microblog filtering task that focuses on tracking topics of broad and dynamic nature. We propose an entirely-unsupervised approach that adapts to new aspects of the topic to retrieve relevant microblogs. We evaluated our filtering approach using 6 broad topics, each tested on 4 different time periods over 4 months. Experimental results showed that, on average, our approach achieved 84% increase in recall relative to the baseline approach, while maintaining an acceptable precision that showed a drop of about 8%. Our filtering method is currently implemented on TweetMogaz, a news portal generated from tweets. The website compiles the stream of Arabic tweets and detects the relevant tweets to different regions in the Middle East to be presented in the form of comprehensive reports that include top stories and news in each region.  相似文献   

3.
Stock prediction via market data analysis is an attractive research topic. Both stock prices and news articles have been employed in the prediction processes. However, how to combine technical indicators from stock prices and news sentiments from textual news articles, and make the prediction model be able to learn sequential information within time series in an intelligent way, is still an unsolved problem. In this paper, we build up a stock prediction system and propose an approach that 1) represents numerical price data by technical indicators via technical analysis, and represents textual news articles by sentiment vectors via sentiment analysis, 2) setup a layered deep learning model to learn the sequential information within market snapshot series which is constructed by the technical indicators and news sentiments, 3) setup a fully connected neural network to make stock predictions. Experiments have been conducted on more than five years of Hong Kong Stock Exchange data using four different sentiment dictionaries, and results show that 1) the proposed approach outperforms the baselines in both validation and test sets using two different evaluation metrics, 2) models incorporating prices and news sentiments outperform models that only use either technical indicators or news sentiments, in both individual stock level and sector level, 3) among the four sentiment dictionaries, finance domain-specific sentiment dictionary (Loughran–McDonald Financial Dictionary) models the news sentiments better, which brings more prediction performance improvements than the other three dictionaries.  相似文献   

4.
Topic models are widely used for thematic structure discovery in text. But traditional topic models often require dedicated inference procedures for specific tasks at hand. Also, they are not designed to generate word-level semantic representations. To address the limitations, we propose a neural topic modeling approach based on the Generative Adversarial Nets (GANs), called Adversarial-neural Topic Model (ATM) in this paper. To our best knowledge, this work is the first attempt to use adversarial training for topic modeling. The proposed ATM models topics with dirichlet prior and employs a generator network to capture the semantic patterns among latent topics. Meanwhile, the generator could also produce word-level semantic representations. Besides, to illustrate the feasibility of porting ATM to tasks other than topic modeling, we apply ATM for open domain event extraction. To validate the effectiveness of the proposed ATM, two topic modeling benchmark corpora and an event dataset are employed in the experiments. Our experimental results on benchmark corpora show that ATM generates more coherence topics (considering five topic coherence measures), outperforming a number of competitive baselines. Moreover, the experiments on event dataset also validate that the proposed approach is able to extract meaningful events from news articles.  相似文献   

5.
Propaganda is a mechanism to influence public opinion, which is inherently present in extremely biased and fake news. Here, we propose a model to automatically assess the level of propagandistic content in an article based on different representations, from writing style and readability level to the presence of certain keywords. We experiment thoroughly with different variations of such a model on a new publicly available corpus, and we show that character n-grams and other style features outperform existing alternatives to identify propaganda based on word n-grams. Unlike previous work, we make sure that the test data comes from news sources that were unseen on training, thus penalizing learning algorithms that model the news sources used at training time as opposed to solving the actual task. We integrate our supervised model in a public website, which organizes recent articles covering the same event on the basis of their propagandistic contents. This allows users to quickly explore different perspectives of the same story, and it also enables investigative journalists to dig further into how different media use stories and propaganda to pursue their agenda.  相似文献   

6.
With the information explosion of news articles, personalized news recommendation has become important for users to quickly find news that they are interested in. Existing methods on news recommendation mainly include collaborative filtering methods which rely on direct user-item interactions and content based methods which characterize the content of user reading history. Although these methods have achieved good performances, they still suffer from data sparse problem, since most of them fail to extensively exploit high-order structure information (similar users tend to read similar news articles) in news recommendation systems. In this paper, we propose to build a heterogeneous graph to explicitly model the interactions among users, news and latent topics. The incorporated topic information would help indicate a user’s interest and alleviate the sparsity of user-item interactions. Then we take advantage of graph neural networks to learn user and news representations that encode high-order structure information by propagating embeddings over the graph. The learned user embeddings with complete historic user clicks capture the users’ long-term interests. We also consider a user’s short-term interest using the recent reading history with an attention based LSTM model. Experimental results on real-world datasets show that our proposed model significantly outperforms state-of-the-art methods on news recommendation.  相似文献   

7.
In this paper, we propose a new language model, namely, a dependency structure language model, for topic detection and tracking (TDT) to compensate for weakness of unigram and bigram language models. The dependency structure language model is based on the Chow expansion theory and the dependency parse tree generated by a linguistic parser. So, long-distance dependencies can be naturally captured by the dependency structure language model. We carried out extensive experiments to verify the proposed model on topic tracking and link detection in TDT. In both cases, the dependency structure language models perform better than strong baseline approaches.  相似文献   

8.
A new approach to narrative abstractive summarization (NATSUM) is presented in this paper. NATSUM is centered on generating a narrative chronologically ordered summary about a target entity from several news documents related to the same topic. To achieve this, first, our system creates a cross-document timeline where a time point contains all the event mentions that refer to the same event. This timeline is enriched with all the arguments of the events that are extracted from different documents. Secondly, using natural language generation techniques, one sentence for each event is produced using the arguments involved in the event. Specifically, a hybrid surface realization approach is used, based on over-generation and ranking techniques. The evaluation demonstrates that NATSUM performed better than extractive summarization approaches and competitive abstractive baselines, improving the F1-measure at least by 50%, when a real scenario is simulated.  相似文献   

9.
In this paper, we present a topic discovery system aimed to reveal the implicit knowledge present in news streams. This knowledge is expressed as a hierarchy of topic/subtopics, where each topic contains the set of documents that are related to it and a summary extracted from these documents. Summaries so built are useful to browse and select topics of interest from the generated hierarchies. Our proposal consists of a new incremental hierarchical clustering algorithm, which combines both partitional and agglomerative approaches, taking the main benefits from them. Finally, a new summarization method based on Testor Theory has been proposed to build the topic summaries. Experimental results in the TDT2 collection demonstrate its usefulness and effectiveness not only as a topic detection system, but also as a classification and summarization tool.  相似文献   

10.
Narratives are comprised of stories that provide insight into social processes. To facilitate the analysis of narratives in a more efficient manner, natural language processing (NLP) methods have been employed in order to automatically extract information from textual sources, e.g., newspaper articles. Existing work on automatic narrative extraction, however, has ignored the nested character of narratives. In this work, we argue that a narrative may contain multiple accounts given by different actors. Each individual account provides insight into the beliefs and desires underpinning an actor’s actions. We present a pipeline for automatically extracting accounts, consisting of NLP methods for: (1) named entity recognition, (2) event extraction, and (3) attribution extraction. Machine learning-based models for named entity recognition were trained based on a state-of-the-art neural network architecture for sequence labelling. For event extraction, we developed a hybrid approach combining the use of semantic role labelling tools, the FrameNet repository of semantic frames, and a lexicon of event nouns. Meanwhile, attribution extraction was addressed with the aid of a dependency parser and Levin’s verb classes. To facilitate the development and evaluation of these methods, we constructed a new corpus of news articles, in which named entities, events and attributions have been manually marked up following a novel annotation scheme that covers over 20 event types relating to socio-economic phenomena. Evaluation results show that relative to a baseline method underpinned solely by semantic role labelling tools, our event extraction approach optimises recall by 12.22–14.20 percentage points (reaching as high as 92.60% on one data set). Meanwhile, the use of Levin’s verb classes in attribution extraction obtains optimal performance in terms of F-score, outperforming a baseline method by 7.64–11.96 percentage points. Our proposed approach was applied on news articles focused on industrial regeneration cases. This facilitated the generation of accounts of events that are attributed to specific actors.  相似文献   

11.
提出一种基于LDA主题模型的科技新闻主题分析方法,选取2009—2018年中、澳、英、美4国极地科考新闻数据,从主题类型和主题强度角度进行主题演化分析。在中文新闻中,极地测绘等主题的热度上升,极地冰川科考主题的热度下降;在英文新闻中,热门主题为极地冰川科考与极地海洋科考;其余主题热度相对稳定。研究结果表明,该方法可以有效识别科技新闻主题并揭示其演化趋势,可以有效改善网络环境下科技情报分析的自动化程度。  相似文献   

12.
海量的网络媒体信息使得人们在有限的时间内难以全面地掌握一些话题的信息,这样容易导致部分重要信息的遗漏。话题检测与追踪技术正是在这种需求下产生的。这种技术可以从庞大的信息集合中快速准确地获取人们感兴趣的内容。近几年,话题检测与追踪技术已成为自然语言处理领域热门的研究方向,它能把大量的信息有效地组织起来,并使用相关技术从中挖掘出有用的信息,用简洁有效的方式让人们了解一个事件或现象中所有细节以及它们之间的相关性。对话题跟踪的研究背景、相关概念、评测方法以及相关技术进行了综述,并总结了当前的相关技术。  相似文献   

13.
With the newspapers' recent move to online reporting, traditional norms and practices of news reporting have changed to accommodate the new realities of online news writing. In particular, online news is much more fluid and prone to change in content than the traditional hard-copy newspapers--online newspaper articles often change over the course of the following days or even weeks as they respond to criticisms and new information becoming available. This poses a problem for social scientists who analyse newspaper coverage of science, health and risk topics, because it is no longer clear who has read and written what version, and what impact they potentially had on the national debates on these topics. In this note I want to briefly flag up this problem through two recent examples of U.K. national science stories and discuss the potential implications for PUS media research.  相似文献   

14.
We present IntoNews, a system to match online news articles with spoken news from a television newscasts represented by closed captions. We formalize the news matching problem as two independent tasks: closed captions segmentation and news retrieval. The system segments closed captions by using a windowing scheme: sliding or tumbling window. Next, it uses each segment to build a query by extracting representative terms. The query is used to retrieve previously indexed news articles from a search engine. To detect when a new article should be surfaced, the system compares the set of retrieved articles with the previously retrieved one. The intuition is that if the difference between these sets is large enough, it is likely that the topic of the newscast currently on air has changed and a new article should be displayed to the user. In order to evaluate IntoNews, we build a test collection using data coming from a second screen application and a major online news aggregator. The dataset is manually segmented and annotated by expert assessors, and used as our ground truth. It is freely available for download through the Webscope program.1 Our evaluation is based on a set of novel time-relevance metrics that take into account three different aspects of the problem at hand: precision, timeliness and coverage. We compare our algorithms against the best method previously proposed in literature for this problem. Experiments show the trade-offs involved among precision, timeliness and coverage of the airing news. Our best method is four times more accurate than the baseline.  相似文献   

15.
[目的/意义]针对同一事件新闻报道与舆情评论既相互依存又偏离的现象,通过话题识别与主题关联分析,探索新闻报道引发的舆情评论在主题内容与时间阶段上的异同,拟为研究以舆情评论表达的舆情事件和以新闻报道表达的社会现实之间的共振与偏离,进而为探究社会舆情传播规律提供参考,为服务政府科学决策提供依据。[方法/过程]以拉斯韦尔(5W)模型、LDA主题模型和Python工具为基础,设计研究思路和流程,从腾讯新闻和知乎平台上抓取新闻报道和评论的数据,经过处理加工过后加以分析挖掘。[结果/结论]研究发现:舆情事件主题会一定程度偏离社会现实主题,衍生出更多隐性主题;舆情事件与社会现实的发展走向较一致;此外,社交媒体相较于新闻媒体所衍生的舆情事件主题更多,而两者反映的社会现实主题类似。  相似文献   

16.
There is an ongoing debate about what is more important in the modern online media newsroom, whether it is the news content and worthiness, or the audience clicks. Using a dataset of over one million articles from five countries (Belarus, Kazakhstan, Poland, Russia, and Ukraine) and a novel machine learning methodology, I demonstrate that the content of news articles has a significant impact on their lifespan. My findings show that articles with positive sentiment tend to be displayed longer, and that high fear emotion scores can extend the lifespan of news articles in autocratic regimes, and the impact is substantial in magnitude. This paper proposes four new methods for improving information management methodology: a flexible version of Latent Dirichlet Allocation (LDA), a technique for performing relative sentiment analysis, a method for determining semantic similarity between a news article and a newspaper's dominant narrative, and a novel approach to unsupervised model validation based on inter-feature consistency.  相似文献   

17.
We present an efficient document clustering algorithm that uses a term frequency vector for each document instead of using a huge proximity matrix. The algorithm has the following features: (1) it requires a relatively small amount of memory and runs fast, (2) it produces a hierarchy in the form of a document classification tree and (3) the hierarchy obtained by the algorithm explicitly reveals a collection structure. We confirm these features and thus show the algorithm's feasibility through clustering experiments in which we use two collections of Japanese documents, the sizes of which are 83,099 and 14,701 documents. We also introduce an application of this algorithm to a document browser. This browser is used in our Japanese-to-English translation aid system. The browsing module of the system consists of a huge database of Japanese news articles and their English translations. The Japanese article collection is clustered into a hierarchy by our method. Since each node in the hierarchy corresponds to a topic in the collection, we can use the hierarchy to directly access articles by topic. A user can learn general translation knowledge of each topic by browsing the Japanese articles and their English translations. We also discuss techniques of presenting a large tree-formed hierarchy on a computer screen.  相似文献   

18.
19.
The wide spread of fake news and its negative impacts on society has attracted a lot of attention to fake news detection. In existing fake news detection methods, particular attention has been paid to the credibility of the users sharing the news on social media, and the news sources based on their level of participation in fake news dissemination. However, these methods have ignored the important role of news topical perspectives (like political viewpoint) in users'/sources' decisions to share/publish the news. These decisions are associated with the viewpoints shared by the echo-chamber that the users belong to, i.e., users' Socio-Cognitive (SC) biases, and the news sources' partisan bias. Therefore, the credibility of users and news sources are varied in different topics according to the mentioned biases; which are completely ignored in current fake news detection studies. In this paper, we propose a Multi-View Co-Attention Network (MVCAN) that jointly models the latent topic-specific credibility of users and news sources for fake news detection. The key idea is to represent news articles, users, and news sources in a way that the topical viewpoints of news articles, SC biases of users which determines the users' viewpoints in sharing news, and the partisan bias of news sources are encoded as vectors. Then a novel variant of the Multi-Head Co-Attention (MHCA) mechanism is proposed to encode the joint interaction from different views, including news-source and news-user to implicitly model the credibility of users and the news sources based on their interaction in real and fake news spreading on the news topic. We conduct extensive experiments on two public datasets. The results show that MVCAN significantly outperforms other state-of-the-art methods and outperforms the best baselines by 3% on average in terms of F1 and Accuracy.  相似文献   

20.
In this work, we release a multi-domain and multi-modality event dataset (MMED), containing 25,052 textual news articles collected from hundreds of news media sites (e.g., Yahoo News, BBC News, etc.) and 75,884 image posts shared on Flickr by thousands of social media users. The articles contributed by professional journalists and the images shared by amateur users are annotated according to 410 real-world events, covering emergencies, natural disasters, sports, ceremonies, elections, protests, military intervention, economic crises, etc. The MMED dataset is collected by the following the principles of high relevance in supporting the application needs, a wide range of event types, non-ambiguity of the event labels, imbalanced event clusters, and difficulty discriminating the event labels. The dataset can stimulate innovative research on related challenging problems, such as (weakly aligned) cross-modal retrieval and cross-domain event discovery, inspire visual relation mining and reasoning, etc. For comparisons, 15 baselines for two scenarios have been quantitatively and qualitatively evaluated using the dataset.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号