中文健康问句分类与语料构建 Question Classification and Corpus Construction of Chinese Health期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

中文健康问句分类与语料构建

引用本文：	郭海红,李姣,代涛.中文健康问句分类与语料构建[J].情报工程,2016,2(6):039-049.

作者姓名：	郭海红李姣代涛

作者单位：	中国医学科学院医学信息研究所北京 100020

基金项目：	本文受中国医学科学院中央级公益性科研院所基本科研业务费课题：中文公众健康问句分类与健康信息需求挖掘研究（2016ZX330011），国家社会科学基金资助项目：面向知识服务的健康知识组织体系构建研究（14BTQ032）的资助。

摘要：	本文旨在构建一个中文健康问句分类方法，并通过对高血压相关的健康问句进行人工分类标注，分析公众的高血压相关健康信息需求，同时为研发高血压相关的智能中文问答系统提供语料基础。本研究基于临床问句分类及公众健康信息查询场景层次模型，构建一个四级中文健康问句主题分类方法，并由5位标注员独立地对从某中文健康网站上收集的将近10万条高血压相关提问数据中随机抽取的2000条样本数据进行人工分类标注，以优化和测试该问句分类方法的可靠性，构建标注语料库，并分析公众的高血压相关健康信息需求。5位标注员使用该分类方法进行独立标注的四级类目评判者间信度kappa值为0.63，意味着分类结果可靠，一级大类获得高度一致性（kappa=0.82），略优于国际上的同类研究。分布在治疗、诊断、健康生活方式、临床发现/病情管理、流行病学、择医六个一级类别中的问句分别占样本总量的48.1%、23.8%、11.9%、5.2%、9.0%和1.9%。所构建的健康问句分类方法可用于组织大型健康问题集，以提高检索效率；分类标注的样本问句可作为高血压相关健康问句自动分类研究的语料；得出的高血压相关健康问句主题分布有助于指导健康网站的知识资源建设。此外，所设计和采用的问句分类方法构建方式、语料标注流程、评判者间信度测量方法等，也可为开放领域及其他受限领域开展用户问句分类与语料构建提供借鉴。
关键词：	健康问句问句分类语料构建公众健康信息需求
Question Classification and Corpus Construction of Chinese Health

Authors:	GUO HaiHong LI Jiao and DAI Tao

Institution:	Institute of Medical Information, Chinese Academy of Medical Science,Institute of Medical Information, Chinese Academy of Medical Science and Institute of Medical Information, Chinese Academy of Medical Science

Abstract:	This study aimed to build up a Chinese health question classiifcation schema and manually annotate hypertension related health question, so as to understand and specify hypertension related informational needs of the users, and further to lay a corpus foundation for hypertension related smart Chinese question and answering (QA) system. This paper built up a four-level classification schema of health questions based on taxonomies of generic clinical questions and a layered model of context for consumer health information searching. Five annotators independently and manually classified 2000 questions which were randomly selected from nearly 100 thousand hypertension-related messages posted on a Chinese health website to modify and test the reliability of the schema, as well as to build an annotated corpus for Chinese health QA system and to analyze the hypertension related information needs of health consumers. The results showed the kappa statistic for five annotators who independently annotated with the schema on the fourth level was 0.63, indicating "substantial" reliability, and reached “almost perfect” reliability (kappa=0.82) on the first level, which was slightly better than the similar studies oversea. Questions in the categories of treatment, diagnosis, healthy lifestyle, management, epidemiology, and health provider choosing were 48.1%, 23.8%, 11.9%, 5.2%, 9.0%, and 1.9% respectively. This study will do help to organize large collections of health question so as to improve retrieval efficiency, to train machine to automatically classify topics of hypertension related questions posted by health consumers, to guide the building of knowledge base of health websites. Besides, the methods for building the question classification schema, the procedure of corpus annotation, and the methods for evaluating the inter-rater reliability that we designed in this research can provide reference for studies about user question classification and corpus building in open domain and other restricted domain.

Keywords:	Health questions question classification corpus building consumer health information needs
本文献已被万方数据等数据库收录！
	点击此处可从《情报工程》浏览原始摘要信息
	点击此处可从《情报工程》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏