首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction.These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels).In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.  相似文献   

2.
李爽 《科技通报》2012,28(4):180-181
针对传统的朴素贝叶斯算法对垃圾邮件的过滤率不高等问题,提出了一种基于最小风险贝叶斯网络垃圾邮件信息过滤技术,提出的最小风险贝叶斯能够减少正常邮件判为垃圾邮件的风险,最后实验表明,与传统的算法相比较,本文提出的方法过滤效果有较大的提高。  相似文献   

3.
4.
Blogging has been an emerging media for people to express themselves. However, the presence of spam blogs (also known as splogs) may reduce the value of blogs and blog search engines. Hence, splog detection has recently attracted much attention from research. Most existing works on splog detection identify splogs using their content/link features and target on spam filters protecting blog search engines’ index from spam. In this paper, we propose a splog detection framework by monitoring the on-line search results. The novelty of our splog detection is that our detection capitalizes on the results returned by search engines. The proposed method therefore is particularly useful in detecting those splogs that have successfully slipped through the spam filters that are also actively generating spam-posts. More specifically, our method monitors the top-ranked results of a sequence of temporally-ordered queries and detects splogs based on blogs’ temporal behavior. The temporal behavior of a blog is maintained in a blog profile. Given blog profiles, splog detecting functions have been proposed and evaluated using real data collected from a popular blog search engine. Our experiments have demonstrated that splogs could be detected with high accuracy. The proposed method can be implemented on top of any existing blog search engine without intrusion to the latter.  相似文献   

5.
The demand to detect opinionated spam, using opinion mining applications to prevent their damaging effects on e-commerce reputations is on the rise in many business sectors globally. The existing spam detection techniques in use nowadays, only consider one or two types of spam entities such as review, reviewer, group of reviewers, and product. Besides, they use a limited number of features related to behaviour, content and the relation of entities which reduces the detection's accuracy. Accordingly, these techniques mostly exploit synthetic datasets to analyse their model and are not able to be applied in the context of the real-world environment. As such, a novel graph-based model called “Multi-iterative Graph-based opinion Spam Detection” (MGSD) in which all various types of entities are considered simultaneously within a unified structure is proposed. Using this approach, the model reveals both implicit (i.e., similar entity's) and explicit (i.e., different entities’) relationships. The MGSD model is able to evaluate the ‘spamicity’ effects of entities more efficiently given it applies a novel multi-iterative algorithm which considers different sets of factors to update the spamicity score of entities. To enhance the accuracy of the MGSD detection model, a higher number of existing weighted features along with the novel proposed features from different categories were selected using a combination of feature fusion techniques and machine learning (ML) algorithms. The MGSD model can also be generalised and applied in various opinionated documents due to employing domain independent features. The output of the MGSD model showed that our feature selection and feature fusion techniques showed a remarkable improvement in detecting spam. The findings of this study showed that MGSD could improve the accuracy of state-of-the-art ML and graph-based techniques by around 5.6% and 4.8%, respectively, also achieving an accuracy of 93% for the detection of spam detection in our synthetic crowdsourced dataset and 95.3% for Ott's crowdsourced dataset.  相似文献   

6.
One of the most relevant problems affecting the efficient use of e-mail to communicate worldwide is the spam phenomenon. Spamming involves flooding Internet with undesired messages aimed to promote illegal or low value products and services. Beyond the existence of different well-known machine learning techniques, collaborative schemes and other complementary approaches, some popular anti-spam frameworks such as SpamAssassin or Wirebrush4SPAM enabled the possibility of using regular expressions to effectively improve filter performance. In this work, we provide a review of existing proposals to automatically generate fully functional regular expressions from any input dataset combining spam and ham messages. Due to configuration difficulties and the low performance achieved by analysed schemes, in this work we introduce DiscoverRegex, a novel automatic spam pattern-finding tool. Patterns generated DiscoverRegex outperform those created by existing approaches (able to avoid FP errors) whilst minimising the computational resources required for its proper operation. DiscoverRegex source code is publicly available at https://github.com/sing-group/DiscoverRegex.  相似文献   

7.
Spam in recent years has pervaded all forms of digital communication.The increase in user base for social platforms like Facebook, Twitter, YouTube, etc., has opened new avenues for spammers. The liberty to contribute content freely has encouraged the spammers to exploit the social platforms for their benefits. E-mail and web search engine being the early victims of spam have attracted serious attention from the information scientists for quite some time. A substantial amount of research has been directed to combat spam on these two platforms. Social networks being quite different in nature from the earlier two, have different kinds of spam and spam-fighting techniques from these domains seldom work. Moreover, due to the continuous and rapid evolution of social media, spam themselves evolve very fast posing a great challenge to the community. Despite being relatively new, there has been a number of attempts in the area of social spam in the recent past and a lot many are certain to come in near future. This paper surveys the recent developments in the area of social spam detection and mitigation, its theoretical models and applications along with their qualitative comparison. We present the state-of-the-art and attempt to provide challenges to be addressed, as the nature and content of spam are bound to get more complicated.  相似文献   

8.
本文在介绍和分析贝叶斯理论的基础上,提出了贝叶斯算法和朴素贝叶斯分类器.并阐述了贝叶斯算法及朴素贝叶斯分类器在反垃圾邮件中的应用.  相似文献   

9.
In this paper, the problem of asynchronous H filtering for singular Markov jump systems with redundant channels under the event-triggered scheme is studied. In order to save the resource of bandwidth limited network and improve quality of data transmission, we utilize event-triggered scheme and employ redundant channels. The redundant channels are modeled as two mutually independent Bernoulli distributed random variables. To formulate the asynchronization phenomena between the system modes and the filter modes, the hidden Markov model is proposed so that the filtering error system has become a singular hidden Markov jump system. The criterion of regular, causal and stochastically stable with a certain H performance for the filtering error system has been obtained. The co-design of asynchronous filter and the event-triggered scheme is proposed in terms of a group of feasible linear matrix inequalities. Two examples are given to show the effectiveness of the proposed method.  相似文献   

10.
We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a “balanced state” for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by “leashing” the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.  相似文献   

11.
目前对垃圾邮件的过滤主要有基于内容、基于IP地址和基于规则等方法,这些方法对垃圾邮件的过滤起到了一定作用。但单种邮件过滤技术只是针对邮件的某种属性进行过滤,因而造成邮件过滤判断的片面性。对此,设计了基于陪审团机制的邮件过滤系统。在这个系统中,各种现有的邮件过滤器对邮件的过滤结果,并非邮件过滤的最后结果,而是作为该系统的邮件过滤判断的一个输入值,最后根据系统所定的计算规则得出邮件的最终过滤结果。  相似文献   

12.
Recommendation is an effective marketing tool widely used in the e-commerce business, and can be made based on ratings predicted from the rating data of purchased items. To improve the accuracy of rating prediction, user reviews or product images have been used separately as side information to learn the latent features of users (items). In this study, we developed a hybrid approach to analyze both user sentiments from review texts and user preferences from item images to make item recommendations more personalized for users. The hybrid model consists of two parallel modules to perform a procedure named the multiscale semantic and visual analyses (MSVA). The first module is designated to conduct semantic analysis on review documents in various aspects with word-aware and scale-aware attention mechanisms, while the second module is assigned to extract visual features with block-aware and visual-aware attention mechanisms. The MSVA model was trained, validated and tested using Amazon Product Data containing sampled reviews varying from 492,970 to 1 million records across 22 different domains. Three state-of-the-art recommendation models were used as the baselines for performance comparisons. Averagely, MSVA reduced the mean squared error (MSE) of predicted ratings by 6.00%, 3.14% and 3.25% as opposed to the three baselines. It was demonstrated that combining semantic and visual analyses enhanced MSVA's performance across a wide variety of products, and the multiscale scheme used in both the review and visual modules of MSVA made significant contributions to the rating prediction.  相似文献   

13.
Modeling user profiles is a necessary step for most information filtering systems – such as recommender systems – to provide personalized recommendations. However, most of them work with users or items as vectors, by applying different types of mathematical operations between them and neglecting sequential or content-based information. Hence, in this paper we study how to propose an adaptive mechanism to obtain user sequences using different sources of information, allowing the generation of hybrid recommendations as a seamless, transparent technique from the system viewpoint. As a proof of concept, we develop the Longest Common Subsequence (LCS) algorithm as a similarity metric to compare the user sequences, where, in the process of adapting this algorithm to recommendation, we include different parameters to control the efficiency by reducing the information used in the algorithm (preference filter), to decide when a neighbor is considered useful enough to be included in the process (confidence filter), to identify whether two interactions are equivalent (δ-matching threshold), and to normalize the length of the LCS in a bounded interval (normalization functions). These parameters can be extended to work with any type of sequential algorithm.We evaluate our approach with several state-of-the-art recommendation algorithms using different evaluation metrics measuring the accuracy, diversity, and novelty of the recommendations, and analyze the impact of the proposed parameters. We have found that our approach offers a competitive performance, outperforming content, collaborative, and hybrid baselines, and producing positive results when either content- or rating-based information is exploited.  相似文献   

14.
[目的/意义] 文章旨在揭示相关政策的主题结构特征及其涵义,为政策精神传播、政策制定与优化、产业布局与发展提供参考。[方法/过程] 主要通过内容分析法和话语分析法,对我国国家及人工智能发展水平不同层级省市的共10项人工智能发展政策文本进行分析。[结果/结论] 研究发现,各省市及国家人工智能政策的内容分布呈现出同一性与差异性并存的特征。最后,揭示了已有政策中文件可能存在的不足,并提出了几点建议,希望能为相关政策制定和优化提供参考。  相似文献   

15.
垃圾邮件的泛滥提出了极为迫切的技术诉求,文章介绍了基于文本分类技术的垃圾邮件过滤系统模型,首先介绍了整个系统工作流程,然后阐述了系统中文本分词,文本特征提取,Winnow线性分类器等关键环节。  相似文献   

16.
The problem of social spam detection has been traditionally modeled as a supervised classification problem. Despite the initial success of this detection approach, later analysis of proposed systems and detection features has shown that, like email spam, the dynamic and adversarial nature of social spam makes the performance achieved by supervised systems hard to maintain. In this paper, we investigate the possibility of using the output of previously proposed supervised classification systems as a tool for spammers discovery. The hypothesis is that these systems are still highly capable of detecting spammers reliably even when their recall is far from perfect. We then propose to use the output of these classifiers as prior beliefs in a probabilistic graphical model framework. This framework allows beliefs to be propagated to similar social accounts. Basing similarity on a who-connects-to-whom network has been empirically critiqued in recent literature and we propose here an alternative definition based on a bipartite users-content interaction graph. For evaluation, we build a Markov Random Field on a graph of similar users and compute prior beliefs using a selection of state-of-the-art classifiers. We apply Loopy Belief Propagation to obtain posterior predictions on users. The proposed system is evaluated on a recent Twitter dataset that we collected and manually labeled. Classification results show a significant increase in recall and a maintained precision. This validates that formulating the detection problem with an undirected graphical model framework permits to restore the deteriorated performances of previously proposed statistical classifiers and to effectively mitigate the effect of spam evolution.  相似文献   

17.
This paper reports on an approach to the analysis of form (layout and formatting) during genre recognition recorded using eye tracking. The researchers focused on eight different types of e-mail, such as calls for papers, newsletters and spam, which were chosen to represent different genres. The study involved the collection of oculographic behavior data based on the scanpath duration and scanpath length based metric, to highlight the ways in which people view the features of genres. We found that genre analysis based on purpose and form (layout features, etc.) was an effective means of identifying the characteristics of these e-mails.  相似文献   

18.
[目的/意义]通过实验分析不同特征提取算法对新闻文本聚类效果的影响。[方法/过程]选取搜狗实验室的搜狐新闻语料库以及澳大利亚广播公司2003-2017年间的新闻标题语料库,对TF-IDF、Word2vec以及Doc2vec三种单一特征,TF-IDF+Word2vec、TF-IDF+Doc2vec、Word2vec+Doc2vec以及TF-IDF+Word2vec+Doc2vec四种组合特征在K-means、凝聚以及DBSCAN算法上分别进行聚类分析,通过Purity以及NMI两个评测指标对聚类效果进行评价。[结果/结论]单类特征中三个特征的聚类质量呈Word2vec> TF-IDF> Doc2vec关系;组合特征中TF-IDF+Word2vec的效果最优。Word2vec在单一特征中的表现最优,其也是不同组合特征间差异的主要因素,特征组合是否可以提升聚类性能需基于多因素进行综合判定。  相似文献   

19.
An experiment was conducted to see how relevance feedback could be used to build and adjust profiles to improve the performance of filtering systems. Data was collected during the system interaction of 18 graduate students with SIFTER (Smart Information Filtering Technology for Electronic Resources), a filtering system that ranks incoming information based on users' profiles. The data set came from a collection of 6000 records concerning consumer health. In the first phase of the study, three different modes of profile acquisition were compared. The explicit mode allowed users to directly specify the profile; the implicit mode utilized relevance feedback to create and refine the profile; and the combined mode allowed users to initialize the profile and to continuously refine it using relevance feedback. Filtering performance, measured in terms of Normalized Precision, showed that the three approaches were significantly different (α=0.05 and p=0.012). The explicit mode of profile acquisition consistently produced superior results. Exclusive reliance on relevance feedback in the implicit mode resulted in inferior performance. The low performance obtained by the implicit acquisition mode motivated the second phase of the study, which aimed to clarify the role of context in relevance feedback judgments. An inductive content analysis of thinking aloud protocols showed dimensions that were highly situational, establishing the importance context plays in feedback relevance assessments. Results suggest the need for better representation of documents, profiles, and relevance feedback mechanisms that incorporate dimensions identified in this research.  相似文献   

20.
P. D’Este  P. Patel 《Research Policy》2007,36(9):1295-1313
This paper examines the different channels through which academic researchers interact with industry and the factors that influence the researchers’ engagement in a variety of interactions. This study is based on a large scale survey of UK academic researchers. The results show that university researchers interact with industry using a wide variety of channels, and engage more frequently in the majority of the channels examined - such as consultancy & contract research, joint research, or training - as compared to patenting or spin-out activities. In explaining the variety and frequency of interactions, we find that individual characteristics of researchers have a stronger impact than the characteristics of their departments or universities. Finally, we argue that by paying greater attention to the broad range of knowledge transfer mechanisms (in addition to patenting and spin-outs), policy initiatives could contribute to building the researchers’ skills necessary to integrate the worlds of scientific research and application.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号