树编辑距离在Web信息抽取中的应用与实现* The Application and Implementation of Tree Edit Distance in Web Information Extraction期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

树编辑距离在Web信息抽取中的应用与实现*

引用本文：	聂卉,黄贵鹏.树编辑距离在Web信息抽取中的应用与实现*[J].现代图书情报技术,2010,26(5):29-34.

作者姓名：	聂卉黄贵鹏

作者单位：	(中山大学资讯管理系广州 510275)

基金项目：	*本文系2008年度教育部人文社会科学研究项目“基于信息抽取的数字图书馆的知识获取研究”(项目编号：08JC870013)和2009年度中山大学青年教师培育项目“智能化深度搜索引擎实现技术的研究”(项目编号：2000-3161101)的研究成果之一。

摘要：	引入编辑距离的概念，探讨如何构造标签树，并利用标签树匹配算法来量化网页结构相似度。该算法被应用于Web信息抽取，通过URL相似度算法进行样本网页的粗聚类，进一步采用树的相似度匹配算法实现细聚类，从而获取模板网页。在模板网页的基础上，再次引入结构相似度算法并结合基于模板网页的抽取规则实现网页的自动化抽取。实验证明，该算法的引入能够有效提高包装器的抽取精度和半自动化能力。
关键词：	Web信息抽取树编辑距离结构相似度 Web聚类
收稿时间：	2010-03-10
修稿时间：	2010-04-26
The Application and Implementation of Tree Edit Distance in Web Information Extraction

Nie Hui,Huang Guipeng.The Application and Implementation of Tree Edit Distance in Web Information Extraction[J].New Technology of Library and Information Service,2010,26(5):29-34.

Authors:	Nie Hui Huang Guipeng

Institution:	(School of Information Management，Sun Yat-Sen University, Guangzhou 510275，China)

Abstract:	In this paper，the concept of edit distance is introduced， and the issues about how to construct a tag tree and calculate the similarity of two Web pages by using the tree-matching algorithm are discussed. Firstly, the pages are roughly clustered according to their URL similarities and further classified by tree-matching algorithm. Based on the model page obtained by clustering, Web information can be extracted automatically by using Web structure similarity algorithm jointed with extraction rules. The test is able to verify the feasibility and efficiency of the algorithm in system.

Keywords:

	点击此处可从《现代图书情报技术》浏览原始摘要信息
	点击此处可从《现代图书情报技术》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏