首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于回归分析的网络恐怖信息主题爬虫
引用本文:黄炜,张展程,朱彬,李岳峰,陆薇.基于回归分析的网络恐怖信息主题爬虫[J].图书情报工作,2018,62(4):121-129.
作者姓名:黄炜  张展程  朱彬  李岳峰  陆薇
作者单位:1. 湖北工业大学经济与管理学院 武汉 430068; 2. 国家电网武汉市东湖新技术开发区供电公司 武汉 430073
基金项目:本文系国家自然科学基金项目"微博环境下实时主动感知网络舆情事件的多核方法研究"(项目编号:71303075)和"大数据环境下基于特征本体学习的无监督文本分类方法研究"(项目编号:71571064)研究成果之一。
摘    要:目的/意义]针对目前从开源网络信息中采集网络恐怖信息难、采集效率低的问题,提出一种回归分析法,以综合语义相关与网页重要性两个因素,从而提高网络恐怖信息的采集效率。方法/过程]通过分析、比较主题爬虫的特性,结合网络恐怖信息的特点,找出PageRank算法和TF-IDF算法中适用于恐怖信息采集的优点,并结合回归分析法,将恐怖信息的采集策略进行相关度预测,用预测结果反馈调节信息的采集过程。结果/结论]网络恐怖信息采集要兼顾采集的数量和质量,在传统主题爬虫算法的基础上进行改进,提出针对于开源网络恐怖信息采集的爬虫优化算法,可以提高信息采集效率。

关 键 词:主题爬虫  回归分析  网络反恐  语义相似度  
收稿时间:2017-08-21

A Network Counter-terrorism Information Crawler Based on the Regression Analysis
Huang Wei,Zhang Zhancheng,Zhu Bing,Li Yuefeng,Lu Wei.A Network Counter-terrorism Information Crawler Based on the Regression Analysis[J].Library and Information Service,2018,62(4):121-129.
Authors:Huang Wei  Zhang Zhancheng  Zhu Bing  Li Yuefeng  Lu Wei
Institution:1. School of Economics and Management, Hubei University of Technology, Wuhan 430068; 2. Wuhan East Lake High-tech Development Zone Power Company, State Grid Corporation of China, Wuhan 430073
Abstract:Purpose/significance] Aiming at the problems that getting the terrorist information on the network is difficult and the acquisition efficiency is low from the open source network information, a method based on the regression analysis is proposed to improve the acquisition efficiency of the network terror information by combining the advantages of the semantic relevance and the web page importance.Method/process] By analyzing and comparing the characteristics of the theme crawler and combining them with the characteristics of the network terrorist information, the advantages of the PageRank algorithm and the IF-IDF algorithm for the collection of the terrorist information were found out. Combined with the regression analysis, the relevance prediction of the terrorist information was done, which reflected the process of the information collection.Result/conclusion] Both the quantity and quality of the collection of the network terrorist information should be taken into consideration. Based on the traditional common network crawler algorithm, this paper proposes a crawler optimization algorithm pertinent to the network terrorist information collection, which improves the collection efficiency.
Keywords:theme crawler  regression analysis  network anti-terrorism  semantic similarity  
点击此处可从《图书情报工作》浏览原始摘要信息
点击此处可从《图书情报工作》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号