首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于句子的文本表示及中文文本分类研究
引用本文:何维,王宇.基于句子的文本表示及中文文本分类研究[J].情报学报,2009,28(6).
作者姓名:何维  王宇
作者单位:大连理工大学管理学院,大连,116024
基金项目:国家自然科学重点基金资助项目 
摘    要:文本挖掘技术是信息资源管理的一项关键技术.向量空间模型是文本挖掘中成熟的文本表示模型,通常以词语或短语作为特征项,但这些特征项只能提供较少的语义信息.为实现基于内容的文本挖掘,本文将文本切分粒度从词语或短语提高到句子,用句子包表示文本,使用句子相似度定义文本相似度,用KNN算法进行中文文本分类,验证模型的可行性.实验证明,基于句子包的KNN算法的平均精度(92.12%)和召回率(92.01%)是比较理想的.

关 键 词:信息资源管理  句子包  文本表示  文本分类

Text Representation Based on Sentence and Chinese Text Categorization
He Wei,Wang Yu.Text Representation Based on Sentence and Chinese Text Categorization[J].Journal of the China Society for Scientific andTechnical Information,2009,28(6).
Authors:He Wei  Wang Yu
Abstract:Text mining is a key technology in information resources management. Vector space model is a mature model of text representation in text mining. Words and phrases are commonly used as feature items, but little semantic information is provided by these items. To carry out text mining based on the content, the segmentation granularity is increased from feature items to sentence. Text is represented by a bag of sentences and text similarity is defined by sentence similarity. In order to validate this representation, a Chinese text classifier has been built by KNN algorithm and good average precision (92.12 % ) and recall (92.01 %) have been achieved in the experiments.
Keywords:information resources management  bag of sentences  text representation  text categorization
本文献已被 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号