首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于VSM的科技期刊文献与专利文献的相似度计算方法研究
引用本文:曾文,徐红姣,李颖,王莉军,赵婧.基于VSM的科技期刊文献与专利文献的相似度计算方法研究[J].情报工程,2016,2(3):037-042.
作者姓名:曾文  徐红姣  李颖  王莉军  赵婧
作者单位:中国科学技术信息研究所 北京 100038
基金项目:本研究得到国家社会科学基金项目(项目编号:14BTQ038)和中国科学技术信息研究所科研项目预研资金项目(项目编 号:YY2016-08)的支持。
摘    要:文本相似度的计算方法以采用TF-IDF的方法对文本建模成词频向量空间模型(VSM)为主,本文结合科技期刊文献和专利文献特点,对TF-IDF的计算方法进行了改进,将词频的统计改进为科技术语的频率统计,提出了一种针对科技文献相似度的计算方法,该方法首先应用自然语言处理技术对科技文献进行预处理,采用科技术语的自动抽取方法进行科技文献术语的自动抽取,结合该文提出的术语权重计算公式构建向量空间模型,来计算科技期刊文献和专利文献之间的相似度。并利用真实有效的科学期刊和文献数据进行实验测试,实验结果表明文中提出的方法优于传统的TF-IDF计算方法。

关 键 词:自然语言处理  TF-IDF  向量空间模型  科技期刊  专利  相似度

The Study of Correlation Calculation Method Based on the VSM for Scientific and Technological Periodicals and Patents
Authors:ZENG Wen  Xu HongJiao  Li Ying  Wang LiJun and Zhao Jing
Institution:Institute of Scientific and Technical Information of China,Institute of Scientific and Technical Information of China,Institute of Scientific and Technical Information of China,Institute of Scientific and Technical Information of China and Institute of Scientific and Technical Information of China
Abstract:Original text similarity measurements employed the TF-IDF method to model the documents as term frequency vector space model (VSM), and compute similarity between the documents. The paper proposed a new literature similarity calculation method for scientific and technological (S&T) documents. According to the characteristics of these documents, we replaced the word frequency statistic method by the scientific term frequency statistic method to improve the algorithm of TF-IDF method. In addition, the new method applied the natural language processing technology to the pretreatment, using the term automatic extraction method for extracting S&T terms. The term weight VSM was constructed to calculate the similarity between S&T periodical literatures and patents by using the new calculation formula. Moreover, this paper used the real S&T documents to test the new method, and compared its results with the original method. The results showed that the proposed method is superior to the original TF-IDF method.
Keywords:Natural language processing  TF-IDF  vector space model  journal of science and technology  patent  similarity
本文献已被 万方数据 等数据库收录!
点击此处可从《情报工程》浏览原始摘要信息
点击此处可从《情报工程》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号