首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于 Scrapy-Splash 的分布式研究生院校信息采集方案
引用本文:郭建林,刘鹏程.基于 Scrapy-Splash 的分布式研究生院校信息采集方案[J].教育技术导刊,2009,8(9):180-185.
作者姓名:郭建林  刘鹏程
作者单位:浙江理工大学 信息学院,浙江 杭州 310018
基金项目:浙江省自然科学基金项目(LQ17E050013);中国纺织工业联合会高等教育教学改革项目(2017BKJGLX293)
摘    要:网络上存在大量考研资讯,如何对这些考研信息进行有效采集、分析、筛选,对后续数据挖掘、数据分析有着举足轻重的作用。在分析 Scrapy 基础上,针对 Scrapy 框架无法下载 AJAX 动态页面的缺陷,提出一种采用 Scrapy-Splash 模块使 Scrapy 也能爬取 AJAX 数据的信息采集方案。通过对 Scrapy 框架 Request 的构造方法和 Response 跟进方法覆写,使 Scrapy Engine 能够向 Splash 发起渲染请求,接收渲染返回的 SplashResponse对象并对其进行调度。利用 Scrapy-Redis 框架设计了一套分布式网络爬虫系统,用于高效、稳定地获取中国研究生招生考试网中的数据。测试结果显示,获取的数据具有实时性和可靠性。

关 键 词:网络爬虫  Scrapy  Splash  动态网页爬取  
收稿时间:2020-01-07

Application of ICA's Two Architectures in Emotion Recognition
Guo Jianlin Liu Pengcheng.Application of ICA's Two Architectures in Emotion Recognition[J].Introduction of Educational Technology,2009,8(9):180-185.
Authors:Guo Jianlin Liu Pengcheng
Institution:School of Information Science and Technology,Zhejiang Sci-Tech University,Zhejiang 310018,China
Abstract:A large amount of information about graduate entrance examination is published on the Internet. How to effectively collect,analyze,and screen these information plays a significant role in the data mining and data analysis. Based on the analysis of Scrapy,in view of the defects that the Scrapy framework cannot download dynamic pages which using AJAX(Asynchronous JavaScript and XML),a strategy that imports Scrapy-Splash module to allow Scrapy to crawl AJAX data is proposed. By overriding the Scrapy framework Request construction method and Response follow-up method,the Scrapy Engine can initiate a rendering request to Splash,and receive and dispatch the returned SplashResponse object. Based on the Scrapy-Redis framework,a distributed web crawler system is designed to efficiently and stably obtain data from the China Graduate Entrance Examination Website. The obtained data is real-time and reliable.
Keywords:Emotion Recognition  ICA  KNN  Cityblock  
本文献已被 CNKI 维普 等数据库收录!
点击此处可从《教育技术导刊》浏览原始摘要信息
点击此处可从《教育技术导刊》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号