首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于GATE语义标注的Web信息的自动抽取
引用本文:聂卉,黄贵鹏.基于GATE语义标注的Web信息的自动抽取[J].图书情报工作,2010,54(5):110-114.
作者姓名:聂卉  黄贵鹏
作者单位:中山大学资讯管理系
基金项目:教育部人文社会科学研究项目 
摘    要:重点研究基于语义标注样本的Web信息自动抽取的实现方法。借助自然语言处理框架GATE,首先引入领域本体对样本网页内容进行语义标注,精确定位出待抽取的语义项,并据此将样本网页解析为S DOM树。从S DOM树中抽取出语义项的特征描述,形成样本实例并采用机器学习算法归纳抽取规则,自动生成包装器。抽取过程中,通过比较网页结构的相似度,系统能够感知网页的变化,主动学习并扩展规则库。试验结果表明,由于精确定位保障了学习样本的质量,小样本学习生成的包装器能够达到较为理想的查全率和查准率。

关 键 词:Web信息抽取  语义标注  包装器  
收稿时间:2009-09-08
修稿时间:2009-11-22

Automatic Web Information Extraction Based on GATE Semantic Annotation
Nie Hui,Huang Guipeng.Automatic Web Information Extraction Based on GATE Semantic Annotation[J].Library and Information Service,2010,54(5):110-114.
Authors:Nie Hui  Huang Guipeng
Institution:Deptartment of Information Management, Sun Yat Sen University,
Abstract:Automatic Web Information Extraction is studied in the paper. By using GATE, an infrastructure for developing and deploying software components that process natural language, domain knowledge come from domain ontology is used for semantic annotation. To begin with, training pages are parsed from S DOM trees after target extraction data are labeled precisely. As training data, features of the target data extracted from the S DOM trees will be fed to rule learner module, extraction rules are induced automatically by machine learning. In the process of extraction, a self adaptive function is designed. The difference of web pages can be detected by checking web page similarity. According to the checking result, rule learner can do learning instructions positively, extend and update the rule set automatically as well. Our experiment shows that the high quality learning sample obtained by precisely semantic labeling make it possible to get a desired recall and precision even with small number of sample pages.
Keywords:Web information extraction  semantic annotation  wrapper  
本文献已被 万方数据 等数据库收录!
点击此处可从《图书情报工作》浏览原始摘要信息
点击此处可从《图书情报工作》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号