一个RSS级别的网页主题内容抽取方法与系统 A RSS Level Web Page Main Content Extraction Method and System期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

一个RSS级别的网页主题内容抽取方法与系统

引用本文：	张艳.一个RSS级别的网页主题内容抽取方法与系统[J].图书情报工作,2010,54(14):107-130.

作者姓名：	张艳

作者单位：	南京信息工程大学图书馆

基金项目：	南京信息工程大学科研基金资助项目

摘要：	提出一个RSS级别的网页主题内容抽取方法与系统，利用RSS feed中的少量entry信息训练得到主题内容模板，通过模板可以对RSS feed下的所有网页进行主题内容抽取。该方法支持分别抽取网页的标题、正文、类别等信息；另外，该方法有自适应机制，能实时侦测模板的变化。从实验结果来看，该方法和系统有很高的召回率和准确率。
关键词：	网页主题内容抽取 RSS 模板自适应机制
收稿时间：	2010-03-02
修稿时间：	2010-04-22
A RSS Level Web Page Main Content Extraction Method and System

Zhang Yan.A RSS Level Web Page Main Content Extraction Method and System[J].Library and Information Service,2010,54(14):107-130.

Authors:	Zhang Yan

Institution:	Nanjing University of Information Science & Technology Library，

Abstract:	This paper proposes a RSS level web page main content extraction method and system. This method uses small amount of entry RSS meta informations in the RSS feed to train main content template, and based on this template, extract main content for all of web page in the RSS feed. This method also supports extracting title, body and category information separately. Furthermore, this method has self adaptation mechanism, it can realtime detect template change. From experiment results, this method and system has high recall and precision.

Keywords:	web page main content extraction RSS template self adaptation mechanism
本文献已被万方数据等数据库收录！
	点击此处可从《图书情报工作》浏览原始摘要信息
	点击此处可从《图书情报工作》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏