首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 312 毫秒
1.
2.
Company movements and market changes often are headlines of the news, providing managers with important business intelligence (BI). While existing corporate analyses are often based on numerical financial figures, relatively little work has been done to reveal from textual news articles factors that represent BI. In this research, we developed BizPro, an intelligent system for extracting and categorizing BI factors from news articles. BizPro consists of novel text mining procedures and BI factor modeling and categorization. Expert guidance and human knowledge (with high inter-rater reliability) were used to inform system development and profiling of BI factors. We conducted a case study of using the system to profile BI factors of four major IT companies based on 6859 sentences extracted from 231 news articles published in major news sources. The results show that the chosen techniques used in BizPro – Naïve Bayes (NB) and Logistic Regression (LR) – significantly outperformed a benchmark technique. NB was found to outperform LR in terms of precision, recall, F-measure, and area under ROC curve. This research contributes to developing a new system for profiling company BI factors from news articles, to providing new empirical findings to enhance understanding in BI factor extraction and categorization, and to addressing an important yet under-explored concern of BI analysis.  相似文献   

3.
This work aims to extract possible causal relations that exist between noun phrases. Some causal relations are manifested by lexical patterns like causal verbs and their sub-categorization. We use lexical patterns as a filter to find causality candidates and we transfer the causality extraction problem to the binary classification. To solve the problem, we introduce probabilities for word pair and concept pair that could be part of causal noun phrase pairs. We also use the cue phrase probability that could be a causality pattern. These probabilities are learned from the raw corpus in an unsupervised manner. With this probabilistic model, we increase both precision and recall. Our causality extraction shows an F-score of 77.37%, which is an improvement of 21.14 percentage points over the baseline model. The long distance causal relation is extracted with the binary tree-styled cue phrase. We propose an incremental cue phrase learning method based on the cue phrase confidence score that was measured after each causal classifier learning step. A better recall of 15.37 percentage points is acquired after the cue phrase learning.  相似文献   

4.
Health misinformation has become an unfortunate truism of social media platforms, where lies could spread faster than truth. Despite considerable work devoted to suppressing fake news, health misinformation, including low-quality health news, persists and even increases in recent years. One promising approach to fighting bad information is studying the temporal and sentiment effects of health news stories and how they are discussed and disseminated on social media platforms like Twitter. As part of the effort of searching for innovative ways to fight health misinformation, this study analyzes a dataset of more than 1600 objectively and independently reviewed health news stories published over a 10-year span and nearly 50,000 Twitter posts responding to them. Specifically, it examines the source credibility of health news circulated on Twitter and the temporal, sentiment features of the tweets containing or responding to the health news reports. The results show that health news stories that are rated low by experts are discussed more, persist longer, and produce stronger sentiments than highly rated ones in the tweetosphere. However, the highly rated stories retained a fresh interest in the form of new tweets for a longer period. An in-depth understanding of the characteristics of health news distribution and discussion is the first step toward mitigating the surge of health misinformation. The findings provide insights into understanding the mechanism of health information dissemination on social media and practical implications to fight and mitigate health misinformation on digital media platforms.  相似文献   

5.
6.
This study attempted to use semantic relations expressed in text, in particular cause-effect relations, to improve information retrieval effectiveness. The study investigated whether the information obtained by matching cause-effect relations expressed in documents with the cause-effect relations expressed in users’ queries can be used to improve document retrieval results, in comparison to using just keyword matching without considering relations.An automatic method for identifying and extracting cause-effect information in Wall Street Journal text was developed. Causal relation matching was found to yield a small but significant improvement in retrieval results when the weights used for combining the scores from different types of matching were customized for each query. Causal relation matching did not perform better than word proximity matching (i.e. matching pairs of causally related words in the query with pairs of words that co-occur within document sentences), but the best results were obtained when causal relation matching was combined with word proximity matching. The best kind of causal relation matching was found to be one in which one member of the causal relation (either the cause or the effect) was represented as a wildcard that could match with any word.  相似文献   

7.
Generating news headlines has been one of the predominant problems in Natural Language Processing research. Modern transformer models, if fine-tuned, can present a good headline with attention to all the parts of a disaster-news article. A disaster-news headline generally focuses on the event, its effect, and the major impacts, which a transformer model lacks when generating the headline. The extract-then-abstract based method proposed in this article improves the performance of a state-of-the-art transformer abstractor to generate a good-quality disaster-news headline. In this work, a Deep Neural Network (DNN) based sentence extractor and a transformer-based abstractive summarizer work sequentially to generate a headline. The sentence extraction task is formulated as a binary classification problem where the DNN model is trained to generate two binary labels: one corresponding to the sentence similarity with ground truth headlines and the other corresponding to the presence of disaster and its impact related information in the sentence. The transformer model generates the headline from the sentences extracted by the DNN. ROUGE scores of the headlines generated using the proposed method are found to be better than the scores of the headlines generated directly from the original documents. The highest ROUGE 1, 2, and 3 score improvements are observed in the case of the Text-To-Text Transfer Transformer (T5) model by 17.85%, 38.13%, and 21.01%, respectively. Such improvements suggest that the proposed method can have a high utility for finding effective headlines from disaster related news articles.  相似文献   

8.
副名结构自20世纪60年代以来,经历了一个由非法语言现象到逐渐被学者研究、关注,并获得合法性的过程。关于它的实质,历来学者众说纷纭,莫衷一是。笔者在前人学者基础上,从副名关系角度入手,提出副名组合的实质是副词修饰准形容词,即突显着特征义的名词,整个结构是一个静中有动的结构。  相似文献   

9.
Legal researchers, recruitment professionals, healthcare information professionals, and patent analysts all undertake work tasks where search forms a core part of their duties. In these instances, the search task is often complex and time-consuming and requires specialist expertise to identify relevant documents and insights within large domain-specific repositories and collections. Several studies have been made investigating the search practices of professionals such as these, but few have attempted to directly compare their professional practices and so it remains unclear to what extent insights and approaches from one domain can be applied to another. In this paper we describe the results of a survey of a purposive sample of 108 legal researchers, 64 recruitment professionals and 107 healthcare information professionals. Their responses are compared with results from a previous survey of 81 patent analysts. The survey investigated their search practices and preferences, the types of functionality they value, and their requirements for future information retrieval systems. The results reveal that these professions share many fundamental needs and face similar challenges. In particular a continuing preference to formulate queries as Boolean expressions, the need to manage, organise and re-use search strategies and results and an ambivalence toward the use of relevance ranking. The results stress the importance of recall and coverage for the healthcare and patent professionals, while precision and recency were more important to the legal and recruitment professionals. The results also highlight the need to ensure that search systems give confidence to the professional searcher and so trust, explainability and accountability remains a significant challenge when developing such systems. The findings suggest that translational research between the different areas could benefit professionals across domains.  相似文献   

10.
This paper focuses on extracting temporal and parent–child relationships between news events in social news. Previous methods have proved that syntactic features are valid. However, most previous methods directly use the static outcomes parsed by syntactic parsing tools, but task-irrelevant or erroneous parses will inevitably degrade the performance of the model. In addition, many implicit higher-order connections that are directly related and critical to tasks are not explicitly exploited. In this paper, we propose a novel syntax-based dynamic latent graph model (SDLG) for this task. Specifically, we first apply a syntactic type-enhanced attention mechanism to assign different weights to different connections in the parsing results, which helps to filter out noisy connections and better fuse the information in the syntactic structures. Next, we introduce a dynamic event pair-aware induction graph to mine the task-related latent connections. It constructs a potential attention matrix to complement and correct the supervised syntactic features, using the semantics of the event pairs as a guide. Finally, the latent graph, together with the syntactic information, is fed into the graph convolutional network to obtain an improved representation of the event to complete relational reasoning. We have conducted extensive experiments on four public benchmarks, MATRES, TCR, HiEve and TB-Dense. The results show that our model outperforms the state-of-the-art model by 0.4%, 1.5%, 3.0% and 1.3% in F1 scores on the four datasets, respectively. Finally, we provide detailed analyses to show the effectiveness of each proposed component.  相似文献   

11.
In this paper, we introduce a novel knowledge-based word-sense disambiguation (WSD) system. In particular, the main goal of our research is to find an effective way to filter out unnecessary information by using word similarity. For this, we adopt two methods in our WSD system. First, we propose a novel encoding method for word vector representation by considering the graphical semantic relationships from the lexical knowledge bases, and the word vector representation is utilized to determine the word similarity in our WSD system. Second, we present an effective method for extracting the contextual words from a text for analyzing an ambiguous word based on word similarity. The results demonstrate that the suggested methods significantly enhance the baseline WSD performance in all corpora. In particular, the performance on nouns is similar to those of the state-of-the-art knowledge-based WSD models, and the performance on verbs surpasses that of the existing knowledge-based WSD models.  相似文献   

12.
In this paper, we propose a novel approach for multilingual story link detection. Our approach utilized the distributional features of terms in timelines and multilingual spaces, together with selected types of named entities in order to get distinctive weights for terms that constitute linguistic representation of events. On timelines term significance is calculated by comparing term distribution of the documents on a day with that of the total document collection. Since two languages can provide more information than one language, term significance is measured on each language space, which is then used as a bridge between two languages on multilingual spaces. Evaluating the method on Korean and Japanese news articles, our method achieved 14.3% improvement for monolingual story pairs, and 16.7% improvement for multilingual story pairs. By measuring the space density, the proposed weighting components are verified with a high density of the intra-event stories and a low density of the inter-events stories. This result indicates that the proposed method is helpful for multilingual story link detection.  相似文献   

13.
Measuring the similarity between the semantic relations that exist between words is an important step in numerous tasks in natural language processing such as answering word analogy questions, classifying compound nouns, and word sense disambiguation. Given two word pairs (AB) and (CD), we propose a method to measure the relational similarity between the semantic relations that exist between the two words in each word pair. Typically, a high degree of relational similarity can be observed between proportional analogies (i.e. analogies that exist among the four words, A is to B such as C is to D). We describe eight different types of relational symmetries that are frequently observed in proportional analogies and use those symmetries to robustly and accurately estimate the relational similarity between two given word pairs. We use automatically extracted lexical-syntactic patterns to represent the semantic relations that exist between two words and then match those patterns in Web search engine snippets to find candidate words that form proportional analogies with the original word pair. We define eight types of relational symmetries for proportional analogies and use those as features in a supervised learning approach. We evaluate the proposed method using the Scholastic Aptitude Test (SAT) word analogy benchmark dataset. Our experimental results show that the proposed method can accurately measure relational similarity between word pairs by exploiting the symmetries that exist in proportional analogies. The proposed method achieves an SAT score of 49.2% on the benchmark dataset, which is comparable to the best results reported on this dataset.  相似文献   

14.
[目的/意义]通过实验分析不同特征提取算法对新闻文本聚类效果的影响。[方法/过程]选取搜狗实验室的搜狐新闻语料库以及澳大利亚广播公司2003-2017年间的新闻标题语料库,对TF-IDF、Word2vec以及Doc2vec三种单一特征,TF-IDF+Word2vec、TF-IDF+Doc2vec、Word2vec+Doc2vec以及TF-IDF+Word2vec+Doc2vec四种组合特征在K-means、凝聚以及DBSCAN算法上分别进行聚类分析,通过Purity以及NMI两个评测指标对聚类效果进行评价。[结果/结论]单类特征中三个特征的聚类质量呈Word2vec> TF-IDF> Doc2vec关系;组合特征中TF-IDF+Word2vec的效果最优。Word2vec在单一特征中的表现最优,其也是不同组合特征间差异的主要因素,特征组合是否可以提升聚类性能需基于多因素进行综合判定。  相似文献   

15.
Political polarization remains perhaps the “greatest barrier” to effective COVID-19 pandemic mitigation measures in the United States. Social media has been implicated in fueling this polarization. In this paper, we uncover the network of COVID-19 related news sources shared to 30 politically biased and 2 neutral subcommunities on Reddit. We find, using exponential random graph modeling, that news sources associated with highly toxic – “rude, disrespectful” – content are more likely to be shared across political subreddits. We also find homophily according to toxicity levels in the network of online news sources. Our findings suggest that news sources associated with high toxicity are rewarded with prominent positions in the resultant network. The toxicity in COVID-19 discussions may fuel political polarization by denigrating ideological opponents and politicizing responses to the COVID-19 pandemic, all to the detriment of mitigation measures. Public health practitioners should monitor toxicity in public online discussions to familiarize themselves with emerging political arguments that threaten adherence to public health crises management. We also recommend, based on our findings, that social media platforms algorithmically promote neutral and scientific news sources to reduce toxic discussion in subcommunities and encourage compliance with public health recommendations in the fight against COVID-19.  相似文献   

16.
False news that spreads on social media has proliferated over the past years and has led to multi-aspect threats in the real world. While there are studies of false news on specific domains (like politics or health care), little work is found comparing false news across domains. In this article, we investigate false news across nine domains on Weibo, the largest Twitter-like social media platform in China, from 2009 to 2019. The newly collected data comprise 44,728 posts in the nine domains, published by 40,215 users, and reposted over 3.4 million times. Based on the distributions and spreads of the multi-domain dataset, we observe that false news in domains that are close to daily life like health and medicine generated more posts but diffused less effectively than those in other domains like politics, and that political false news had the most effective capacity for diffusion. The widely diffused false news posts on Weibo were associated strongly with certain types of users — by gender, age, etc. Further, these posts provoked strong emotions in the reposts and diffused further with the active engagement of false-news starters. Our findings have the potential to help design false news detection systems in suspicious news discovery, veracity prediction, and display and explanation. The comparison of the findings on Weibo with those of existing work demonstrates nuanced patterns, suggesting the need for more research on data from diverse platforms, countries, or languages to tackle the global issue of false news. The code and new anonymized dataset are available at https://github.com/ICTMCG/Characterizing-Weibo-Multi-Domain-False-News.  相似文献   

17.
The performance of information retrieval systems is limited by the linguistic variation present in natural language texts. Word-level natural language processing techniques have been shown to be useful in reducing this variation. In this article, we summarize our work on the extension of these techniques for dealing with phrase-level variation in European languages, taking Spanish as a case in point. We propose the use of syntactic dependencies as complex index terms in an attempt to solve the problems deriving from both syntactic and morpho-syntactic variation and, in this way, to obtain more precise index terms. Such dependencies are obtained through a shallow parser based on cascades of finite-state transducers in order to reduce as far as possible the overhead due to this parsing process. The use of different sources of syntactic information, queries or documents, has been also studied, as has the restriction of the dependencies applied to those obtained from noun phrases. Our approaches have been tested using the CLEF corpus, obtaining consistent improvements with regard to classical word-level non-linguistic techniques. Results show, on the one hand, that syntactic information extracted from documents is more useful than that from queries. On the other hand, it has been demonstrated that by restricting dependencies to those corresponding to noun phrases, important reductions of storage and management costs can be achieved, albeit at the expense of a slight reduction in performance.  相似文献   

18.
A news article’s online audience provides useful insights about the article’s identity. However, fake news classifiers using such information risk relying on profiling. In response to the rising demand for ethical AI, we present a profiling-avoiding algorithm that leverages Twitter users during model optimisation while excluding them when an article’s veracity is evaluated. For this, we take inspiration from the social sciences and introduce two objective functions that maximise correlation between the article and its spreaders, and among those spreaders. We applied our profiling-avoiding algorithm to three popular neural classifiers and obtained results on fake news data discussing a variety of news topics. The positive impact on prediction performance demonstrates the soundness of the proposed objective functions to integrate social context in text-based classifiers. Moreover, statistical visualisation and dimension reduction techniques show that the user-inspired classifiers better discriminate between unseen fake and true news in their latent spaces. Our study serves as a stepping stone to resolve the underexplored issue of profiling-dependent decision-making in user-informed fake news detection.  相似文献   

19.
A method is introduced to recognize the part-of-speech for English texts using knowledge of linguistic regularities rather than voluminous dictionaries. The algorithm proceeds in two steps; in the first step information concerning the part-of-speech is extracted from each word of the text in isolation using morphological analysis as well as the fact that in English there are a reasonable number of word endings which are characteristic of the part-of-speech. The second step is to look at a whole sentence and, using syntactic criteria, to assign the part-of-speech to a single word according to the parts-of-speech and other features of the surrounding words. In particular, those parts-of-speech which are relevant for automatic indexing of documents, i.e. nouns, adjectives, and verbs, are recognized. An application of this method to a large corpus of scientific text showed the result that for 84% of the words the part-of-speech was identified correctly and only for 2% definitely wrong; for the rest of the words ambiguous assignments were made. Using only word lists of a limited extent, the technique thus may be a valuable tool aiding automatic indexing of documents and automatic thesaurus construction as well as other kinds of natural language processing.  相似文献   

20.
This study aims at helping people recognize health misinformation on social media in China. A scheme was first developed to identify the features of health misinformation on social media based on content analysis of 482 pieces of health information from WeChat, a social media platform widely used in China. This scheme was able to identify salient features of health misinformation, including exaggeration/absolutes, induced text, claims of being unique and secret, intemperate tone or language, and statements of excessive significance and likewise. The scheme was then evaluated in a user-centred experiment to test if it is useful in identifying features of health misinformation. Forty-four participants for the experimental group and 38 participants for the control group participated and finished the experiment, which compared the effectiveness of these participants in using the scheme to identify health misinformation. The results indicate that the scheme is effective in terms of improving users’ capability in health misinformation identification. The results also indicate that the participants’ capability of recognizing misinformation in the experimental group has been significantly improved compared to those of the control group. The study provides insights into health misinformation and has implications in enhancing people's online health information literacy. It informs the development of a system that can automatically limit the spread of health misinformation. Moreover, it potentially improves users’ online health information literacy, in particular, under the circumstances of the COVID-19 pandemic.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号