一种通用HTML网页主题信息提取方法* A General Approach to Extracting Topical Information in HTML Pages期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

一种通用HTML网页主题信息提取方法*

引用本文：	许文,都云程,李渝勤,施水才.一种通用HTML网页主题信息提取方法*[J].现代图书情报技术,2007,2(1):40-43.

作者姓名：	许文都云程李渝勤施水才

作者单位：	北京信息科技大学中文信息处理研究中心,北京,100101

基金项目：	国家自然科学基金;北京市教委科技发展计划项目;北京市科委科研项目;北京市教委科研项目

摘要：	采用DOM规范，把HTML网页表示成树结构，对不同模板的HTML页面“主题”信息提取进行研究和分析，提出一种新的结点主题相关性判定方法，依据此方法判定出要抽取的主题内容，并删除无关内容，结果输出只含主题信息的HTML文档。
关键词：	信息提取分块相关度
收稿时间：	2006-10-09
修稿时间：	2006-10-09
A General Approach to Extracting Topical Information in HTML Pages

Xu Wen,Du Yuncheng,Li Yuqin,Shi Shuicai.A General Approach to Extracting Topical Information in HTML Pages[J].New Technology of Library and Information Service,2007,2(1):40-43.

Authors:	Xu Wen Du Yuncheng Li Yuqin Shi Shuicai

Abstract:	By researching how to extract the topical contents in different kinds of templates of Web pages, this paper introduces a new extraction methodology based on DOM. The approach transforms HTML documents into DOM trees. According to the method, the topical contents are extracted and topic-unrelated content are deleted. The result of the approach represents the HTML document which only contains the topic information.

Keywords:	DOM
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《现代图书情报技术》浏览原始摘要信息
	点击此处可从《现代图书情报技术》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏