首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 11 毫秒
1.
Detecting collusive spammers who collaboratively post fake reviews is extremely important to guarantee the reliability of review information on e-commerce platforms. In this research, we formulate the collusive spammer detection as an anomaly detection problem and propose a novel detection approach based on heterogeneous graph attention network. First, we analyze the review dataset from different perspectives and use the statistical distribution to model each user's review behavior. By introducing the Bhattacharyya distance, we calculate the user-user and product-product correlation degrees to construct a multi-relation heterogeneous graph. Second, we combine the biased random walk strategy and multi-head self-attention mechanism to propose a model of heterogeneous graph attention network to learn the node embeddings from the multi-relation heterogeneous graph. Finally, we propose an improved community detection algorithm to acquire candidate spamming groups and employ an anomaly detection model based on the autoencoder to identify collusive spammers. Experiments show that the average improvements of precision@k and recall@k of the proposed approach over the best baseline method on the Amazon, Yelp_Miami, Yelp_New York, Yelp_San Francisco, and YelpChi datasets are [13%, 3%], [32%, 12%], [37%, 7%], [42%, 10%], and [18%, 1%], respectively.  相似文献   

2.
Spam in recent years has pervaded all forms of digital communication.The increase in user base for social platforms like Facebook, Twitter, YouTube, etc., has opened new avenues for spammers. The liberty to contribute content freely has encouraged the spammers to exploit the social platforms for their benefits. E-mail and web search engine being the early victims of spam have attracted serious attention from the information scientists for quite some time. A substantial amount of research has been directed to combat spam on these two platforms. Social networks being quite different in nature from the earlier two, have different kinds of spam and spam-fighting techniques from these domains seldom work. Moreover, due to the continuous and rapid evolution of social media, spam themselves evolve very fast posing a great challenge to the community. Despite being relatively new, there has been a number of attempts in the area of social spam in the recent past and a lot many are certain to come in near future. This paper surveys the recent developments in the area of social spam detection and mitigation, its theoretical models and applications along with their qualitative comparison. We present the state-of-the-art and attempt to provide challenges to be addressed, as the nature and content of spam are bound to get more complicated.  相似文献   

3.
Micro-blogging services such as Twitter allow anyone to publish anything, anytime. Needless to say, many of the available contents can be diminished as babble or spam. However, given the number and diversity of users, some valuable pieces of information should arise from the stream of tweets. Thus, such services can develop into valuable sources of up-to-date information (the so-called real-time web) provided a way to find the most relevant/trustworthy/authoritative users is available. Hence, this makes a highly pertinent question for which graph centrality methods can provide an answer. In this paper the author offers a comprehensive survey of feasible algorithms for ranking users in social networks, he examines their vulnerabilities to linking malpractice in such networks, and suggests an objective criterion against which to compare such algorithms. Additionally, he suggests a first step towards “desensitizing” prestige algorithms against cheating by spammers and other abusive users.  相似文献   

4.
Dynamic Ensemble Selection (DES) strategy is one of the most common and effective techniques in machine learning to deal with classification problems. DES systems aim to construct an ensemble consisting of the most appropriate classifiers selected from the candidate classifier pool according to the competence level of the individual classifier. Since several classifiers are selected, their combination becomes crucial. However, most of current DES approaches focus on the combination of the selected classifiers while ignoring the local information surrounding the query sample needed to be classified. In order to boost the performance of DES-based classification systems, we in this paper propose a dynamic weighting framework for the classifier fusion during obtaining the final output of an DES system. In particular, the proposed method first employs a DES approach to obtain a group of classifiers for a query sample. Then, the hypothesis vector of the selected ensemble is obtained based on the analysis of consensus. Finally, a distance-based weighting scheme is developed to adjust the hypothesis vector depending on the closeness of the query sample to each class. The proposed method is tested on 30 real-world datasets with six well-known DES approaches based on both homogeneous and heterogeneous ensemble. The obtained results, supported by proper statistical tests, show that our method outperforms, both in terms of accuracy and kappa measures, the original DES framework.  相似文献   

5.
Both node classification and link prediction are popular topics of supervised learning on the graph data, but previous works seldom integrate them together to capture their complementary information. In this paper, we propose a Multi-Task and Multi-Graph Convolutional Network (MTGCN) to jointly conduct node classification and link prediction in a unified framework. Specifically, MTGCN consists of multiple multi-task learning so that each multi-task learning learns the complementary information between node classification and link prediction. In particular, each multi-task learning uses different inputs to output representations of the graph data. Moreover, the parameters of one multi-task learning initialize the parameters of the other multi-task learning, so that the useful information in the former multi-task learning can be propagated to the other multi-task learning. As a result, the information is augmented to guarantee the quality of representations by exploring the complex constructure inherent in the graph data. Experimental results on six datasets show that our MTGCN outperforms the comparison methods in terms of both node classification and link prediction.  相似文献   

6.
Social networks have grown into a widespread form of communication that allows a large number of users to participate in conversations and consume information at any time. The casual nature of social media allows for nonstandard terminology, some of which may be considered rude and derogatory. As a result, a significant portion of social media users is found to express disrespectful language. This problem may intensify in certain developing countries where young children are granted unsupervised access to social media platforms. Furthermore, the sheer amount of social media data generated daily by millions of users makes it impractical for humans to monitor and regulate inappropriate content. If adolescents are exposed to these harmful language patterns without adequate supervision, they may feel obliged to adopt them. In addition, unrestricted aggression in online forums may result in cyberbullying and other dreadful occurrences. While computational linguistics research has addressed the difficulty of detecting abusive dialogues, issues remain unanswered for low-resource languages with little annotated data, leading the majority of supervised techniques to perform poorly. In addition, social media content is often presented in complex, context-rich formats that encourage creative user involvement. Therefore, we propose to improve the performance of abusive language detection and classification in a low-resource setting, using both the abundant unlabeled data and the context features via the co-training protocol that enables two machine learning models, each learning from an orthogonal set of features, to teach each other, resulting in an overall performance improvement. Empirical results reveal that our proposed framework achieves F1 values of 0.922 and 0.827, surpassing the state-of-the-art baselines by 3.32% and 45.85% for binary and fine-grained classification tasks, respectively. In addition to proving the efficacy of co-training in a low-resource situation for abusive language detection and classification tasks, the findings shed light on several opportunities to use unlabeled data and contextual characteristics of social networks in a variety of social computing applications.  相似文献   

7.
Online recommender systems have been shown to be vulnerable to group shilling attacks in which attackers of a shilling group collaboratively inject fake profiles with the aim of increasing or decreasing the frequency that particular items are recommended. Existing detection methods mainly use the frequent itemset (dense subgraph) mining or clustering method to generate candidate groups and then utilize the hand-crafted features to identify shilling groups. However, such two-stage detection methods have two limitations. On the one hand, due to the sensitivity of support threshold or clustering parameters setting, it is difficult to guarantee the quality of candidate groups generated. On the other hand, they all rely on manual feature engineering to extract detection features, which is costly and time-consuming. To address these two limitations, we present a shilling group detection method based on graph convolutional network. First, we model the given dataset as a graph by treating users as nodes and co-rating relations between users as edges. By assigning edge weights and filtering normal user relations, we obtain the suspicious user relation graph. Second, we use principal component analysis to refine the rating features of users and obtain the user feature matrix. Third, we design a three-layer graph convolutional network model with a neighbor filtering mechanism and perform user classification by combining both structure and rating features of users. Finally, we detect shilling groups through identifying target items rated by the attackers according to the user classification results. Extensive experiments show that the classification accuracy and detection performance (F1-measure) of the proposed method can reach 98.92% and 99.92% on the Netflix dataset and 93.18% and 92.41% on the Amazon dataset.  相似文献   

8.
Blogging has been an emerging media for people to express themselves. However, the presence of spam blogs (also known as splogs) may reduce the value of blogs and blog search engines. Hence, splog detection has recently attracted much attention from research. Most existing works on splog detection identify splogs using their content/link features and target on spam filters protecting blog search engines’ index from spam. In this paper, we propose a splog detection framework by monitoring the on-line search results. The novelty of our splog detection is that our detection capitalizes on the results returned by search engines. The proposed method therefore is particularly useful in detecting those splogs that have successfully slipped through the spam filters that are also actively generating spam-posts. More specifically, our method monitors the top-ranked results of a sequence of temporally-ordered queries and detects splogs based on blogs’ temporal behavior. The temporal behavior of a blog is maintained in a blog profile. Given blog profiles, splog detecting functions have been proposed and evaluated using real data collected from a popular blog search engine. Our experiments have demonstrated that splogs could be detected with high accuracy. The proposed method can be implemented on top of any existing blog search engine without intrusion to the latter.  相似文献   

9.
10.
The polarity shift problem is a major factor that affects classification performance of machine-learning-based sentiment analysis systems. In this paper, we propose a three-stage cascade model to address the polarity shift problem in the context of document-level sentiment classification. We first split each document into a set of subsentences and build a hybrid model that employs rules and statistical methods to detect explicit and implicit polarity shifts, respectively. Secondly, we propose a polarity shift elimination method, to remove polarity shift in negations. Finally, we train base classifiers on training subsets divided by different types of polarity shifts, and use a weighted combination of the component classifiers for sentiment classification. The results on a range of experiments illustrate that our approach significantly outperforms several alternative methods for polarity shift detection and elimination.  相似文献   

11.
Socially similar social media users can be defined as users whose frequently visited locations in their social media histories are similar. Discovering socially similar social media users is important for several applications, such as, community detection, friendship analysis, location recommendation, urban planning, and anomaly user and behavior detection. Discovering socially similar users is challenging due to dataset size and dimensions, spam behaviors of social media users, spatial and temporal aspects of social media datasets, and location sparseness in social media datasets. In the literature, several studies are conducted to discover similar social media users out of social media datasets using spatial and temporal information. However, most of these studies rely on trajectory pattern mining methods or take into account semantic information of social media datasets. Limited number of studies focus on discovering similar users based on their social media location histories. In this study, to discover socially similar users, frequently visited or socially important locations of social media users are taken into account instead of all locations that users visited. A new interest measure, which is based on Levenshtein distance, was proposed to quantify user similarity based on their socially important locations and two algorithms were developed using the proposed method and interest measure. The algorithms were experimentally evaluated on a real-life Twitter dataset. The results show that the proposed algorithms could successfully discover similar social media users based on their socially important locations.  相似文献   

12.
This study addresses the usage of different features to complement synset-based and bag-of-words representations of texts in the context of using classical ML approaches for spam filtering (Ferrara, 2019). Despite the existence of a large number of complementary features, in order to improve the applicability of this study, we have selected only those that can be computed regardless of the communication channel used to distribute content. Feature evaluation has been performed using content distributed through different channels (social networks and email) and classifiers (Adaboost, Flexible Bayes, Naïve Bayes, Random Forests, and SVMs). The results have revealed the usefulness of detecting some non-textual entities (such as URLs, Uniform Resource Locators) in the addressed distribution channels. Moreover, we also found that compression properties and/or information regarding the probability of correctly guessing the language of target texts could be successfully used to improve the classification in a wide range of situations. Finally, we have also detected features that are influenced by specific fashions and habits of users of certain Internet services (e.g. the existence of words written in capital letters) that are not useful for spam filtering.  相似文献   

13.
The demand for transparency and fairness in AI-based decision-making systems is constantly growing. Organisations need to be assured that their applications, based on these technologies, behave fairly, without introducing negative social implications in relation to sensitive attributes such as gender or race. Since the notion of fairness is context dependent and not uniquely defined, studies in the literature have proposed various formalisation. In this work, we propose a novel, flexible, discrimination-aware decision-tree that allows the user to employ different fairness criteria depending on the application domain. Our approach enhances decision-tree classifiers to provide transparent and fair rules to final users.  相似文献   

14.
Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.  相似文献   

15.
Learning latent representations for users and points of interests (POIs) is an important task in location-based social networks (LBSN), which could largely benefit multiple location-based services, such as POI recommendation and social link prediction. Many contextual factors, like geographical influence, user social relationship and temporal information, are available in LBSN and would be useful for this task. However, incorporating all these contextual factors for user and POI representation learning in LBSN remains challenging, due to their heterogeneous nature. Although the encouraging performance of POI recommendation and social link prediction are delivered, most of the existing representation learning methods for LBSN incorporate only one or two of these contextual factors. In this paper, we propose a novel joint representation learning framework for users and POIs in LBSN, named UP2VEC. In UP2VEC, we present a heterogeneous LBSN graph to incorporate all these aforementioned factors. Specifically, the transition probabilities between nodes inside the heterogeneous graph are derived by jointly considering these contextual factors. The latent representations of users and POIs are then learnt by matching the topological structure of the heterogeneous graph. For evaluating the effectiveness of UP2VEC, a series of experiments are conducted with two real-world datasets (Foursquare and Gowalla) in terms of POI recommendation and social link prediction. Experimental results demonstrate that the proposed UP2VEC significantly outperforms the existing state-of-the-art alternatives. Further experiment shows the superiority of UP2VEC in handling cold-start problem for POI recommendation.  相似文献   

16.
Since meta-paths have the innate ability to capture rich structure and semantic information, meta-path-based recommendations have gained tremendous attention in recent years. However, how to composite these multi-dimensional meta-paths? How to characterize their dynamic characteristics? How to automatically learn their priority and importance to capture users' diverse and personalized preferences at the user-level granularity? These issues are pivotal yet challenging for improving both the performance and the interpretability of recommendations. To address these challenges, we propose a personalized recommendation method via Multi-Dimensional Meta-Paths Temporal Graph Probabilistic Spreading (MD-MP-TGPS). Specifically, we first construct temporal multi-dimensional graphs with full consideration of the interest drift of users, obsolescence and popularity of items, and dynamic update of interaction behavior data. Then we propose a dimension-free temporal graph probabilistic spreading framework via multi-dimensional meta-paths. Moreover, to automatically learn the priority and importance of these multi-dimensional meta-paths at the user-level granularity, we propose two boosting strategies for personalized recommendation. Finally, we conduct comprehensive experiments on two real-world datasets and the experimental results show that the proposed MD-MP-TGPS method outperforms the compared state-of-the-art methods in such performance indicators as precision, recall, F1-score, hamming distance, intra-list diversity and popularity in terms of accuracy, diversity, and novelty.  相似文献   

17.
We propose a topic-dependent attention model for sentiment classification and topic extraction. Our model assumes that a global topic embedding is shared across documents and employs an attention mechanism to derive local topic embedding for words and sentences. These are subsequently incorporated in a modified Gated Recurrent Unit (GRU) for sentiment classification and extraction of topics bearing different sentiment polarities. Those topics emerge from the words’ local topic embeddings learned by the internal attention of the GRU cells in the context of a multi-task learning framework. In this paper, we present the hierarchical architecture, the new GRU unit and the experiments conducted on users’ reviews which demonstrate classification performance on a par with the state-of-the-art methodologies for sentiment classification and topic coherence outperforming the current approaches for supervised topic extraction. In addition, our model is able to extract coherent aspect-sentiment clusters despite using no aspect-level annotations for training.  相似文献   

18.
The research field of crisis informatics examines, amongst others, the potentials and barriers of social media use during disasters and emergencies. Social media allow emergency services to receive valuable information (e.g., eyewitness reports, pictures, or videos) from social media. However, the vast amount of data generated during large-scale incidents can lead to issue of information overload. Research indicates that supervised machine learning techniques are suitable for identifying relevant messages and filter out irrelevant messages, thus mitigating information overload. Still, they require a considerable amount of labeled data, clear criteria for relevance classification, a usable interface to facilitate the labeling process and a mechanism to rapidly deploy retrained classifiers. To overcome these issues, we present (1) a system for social media monitoring, analysis and relevance classification, (2) abstract and precise criteria for relevance classification in social media during disasters and emergencies, (3) the evaluation of a well-performing Random Forest algorithm for relevance classification incorporating metadata from social media into a batch learning approach (e.g., 91.28%/89.19% accuracy, 98.3%/89.6% precision and 80.4%/87.5% recall with a fast training time with feature subset selection on the European floods/BASF SE incident datasets), as well as (4) an approach and preliminary evaluation for relevance classification including active, incremental and online learning to reduce the amount of required labeled data and to correct misclassifications of the algorithm by feedback classification. Using the latter approach, we achieved a well-performing classifier based on the European floods dataset by only requiring a quarter of labeled data compared to the traditional batch learning approach. Despite a lesser effect on the BASF SE incident dataset, still a substantial improvement could be determined.  相似文献   

19.
A news article’s online audience provides useful insights about the article’s identity. However, fake news classifiers using such information risk relying on profiling. In response to the rising demand for ethical AI, we present a profiling-avoiding algorithm that leverages Twitter users during model optimisation while excluding them when an article’s veracity is evaluated. For this, we take inspiration from the social sciences and introduce two objective functions that maximise correlation between the article and its spreaders, and among those spreaders. We applied our profiling-avoiding algorithm to three popular neural classifiers and obtained results on fake news data discussing a variety of news topics. The positive impact on prediction performance demonstrates the soundness of the proposed objective functions to integrate social context in text-based classifiers. Moreover, statistical visualisation and dimension reduction techniques show that the user-inspired classifiers better discriminate between unseen fake and true news in their latent spaces. Our study serves as a stepping stone to resolve the underexplored issue of profiling-dependent decision-making in user-informed fake news detection.  相似文献   

20.
Analyzing and extracting insights from user-generated data has become a topic of interest among businesses and research groups because such data contains valuable information, e.g., consumers’ opinions, ratings, and recommendations of products and services. However, the true value of social media data is rarely discovered due to overloaded information. Existing literature in analyzing online hotel reviews mainly focuses on a single data resource, lexicon, and analysis method and rarely provides marketing insights and decision-making information to improve business’ service and quality of products. We propose an integrated framework which includes a data crawler, data preprocessing, sentiment-sensitive tree construction, convolution tree kernel classification, aspect extraction and category detection, and visual analytics to gain insights into hotel ratings and reviews. The empirical findings show that our proposed approach outperforms baseline algorithms as well as well-known sentiment classification methods, and achieves high precision (0.95) and recall (0.96). The visual analytics results reveal that Business travelers tend to give lower ratings, while Couples tend to give higher ratings. In general, users tend to rate lowest in July and highest in December. The Business travelers more frequently use negative keywords, such as “rude,” “terrible,” “horrible,” “broken,” and “dirty,” to express their dissatisfied emotions toward their hotel stays in July.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号