首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 156 毫秒
1.
李慧 《现代情报》2015,35(4):172-177
词语相似度计算方法在信息检索、词义消歧、机器翻译等自然语言处理领域有着广泛的应用。现有的词语相似度算法主要分为基于统计和基于语义资源两类方法,前者是从大规模的语料中统计与词语共现的上下文信息以计算其相似度,而后者利用人工构建的语义词典或语义网络计算相似度。本文比较分析了两类词语相似度算法,重点介绍了基于Web语料库和基于维基百科的算法,并总结了各自的特点和不足之处。最后提出,在信息技术的影响下,基于维基百科和基于混合技术的词语相似度算法以及关联数据驱动的相似性计算具有潜在的发展趋势。  相似文献   

2.
本文提出了利用后缀树模抽的最大相似度优先聚类方法,通过构造文档集的广义后缀树模型抽取短语作为特征项并映射到M维向量空间模型;计算文档间的相似度矩阵,对任意两个文档之间的相似度进行降序排列,优先合并具备最大相似度的文档对形成初始聚类;合并初始聚类得到最终聚类结果。  相似文献   

3.
基于改进VSM的中文问答系统研究   总被引:1,自引:0,他引:1  
针对向量空间模型中的权重计算公式仅考虑词语项在文档中的相关频数,提出词语项本身的领域权重概念,改进了向量空间模型的权重计算.同时结合关键词距离和关键词顺序信息,实现了句子相似度计算,以特定课程的FAQ库检索作S@n测试对比,结果表明改进后的相似度模型提高了s@n值  相似文献   

4.
[目的/意义]旨在为跨语言文本聚类研究提供参考。[方法/过程]首先,通过分句及计算每个句子的语义特征值确定文档的特征句集并进行文档向量表示;其次,将词旋转距离(Word Rotator’s Distance,WRD)的思路引入相似度计算步骤中,提出语义特征句向量距离(Semantic Feature Sentence Vectors’ Distance, SFSVD)相似度计算方法,获得不同文档间的相似度;最后,利用HAC聚类算法获得文本聚类的结果。[结果/结论]提出的汉语-俄语跨语言文本聚类方法对比现有方法,其Purity值和NMI值显著提升且表现稳定。基于语义特征句和SFSVD相似度计算方法能够较准确地表示文本信息,从而进一步提升汉语-俄语跨语言文本聚类的性能。  相似文献   

5.
盛秋艳 《情报科学》2012,(8):1238-1241
本体技术作为一种能在语义和知识层次上描述概念体系的有效工具,给词语间相似度计算带来了新的机会。词语相似度的研究,是知识表示以及信息检索领域中的一个重要内容。本文利用本体来组织概念,计算概念之间的语义相似度,将语义相似度分成概念相似度和描述相似度,把概念相似度和描述相似度进行合并,生成最终的语义相似度。依据《中国分类主题词表》建立的计算机领域本体,验证了语义相似度计算方法的有效性。  相似文献   

6.
张瑾 《情报科学》2013,(8):71-76
基于《中图法》的语义本体相似度计算,是结合《中图法》内容和结构体系,利用语义逻辑关系等手段,进行语义相似度计算,而建立的推理规则能较好地体现词语之间的语义关系,提高了词语相似度的计算精度。  相似文献   

7.
基于语义向量空间模型的文档检索系统研究   总被引:1,自引:0,他引:1  
针对向量空间模型中因义相似度,建立了语义向量空间模型,并设计了基于语义向量空间模型的文档检索系统,重点研究了其中语义相似度计算和查询扩展两个核心技术,并通过实例验证了该检索系统的有效性.  相似文献   

8.
围绕文本聚类中的文本表示和相似度计算两个基本的问题,对目前学界提出的文本表示方法和相似度计算方法进行了分类和较为全面的综述,将文本表示模型分为向量空间模型、语言模型、后缀树模型、本体等,相似度计算方法分为基于向量空间模型的相似度计算,基于短语的相似度计算方法和基于本体的相似度计算方法。  相似文献   

9.
信息检索中文本相似度的研究   总被引:2,自引:0,他引:2  
本文利用词频矩阵、模糊相似矩阵和模糊聚类中的最大树方法 ,在基于相关性检索的一组文档中 ,用绝对值减数法计算文本的相似度 ,并用一个实例与常用的余弦计算法进行了比较 ,取得了较好的结果。  相似文献   

10.
传统信息检索方法忽视了文档结构对信息检索过程的影响.本文提出了一种改进的基于文档结构的信息检索方法,该方法首先使用第一类特征域对检索文档集进行过滤,然后使用第二类特征域进行匹配排序;引入AHP方法动态确定各特征域的重要性权重因子;最后使用向量内积计算的方法合成总相似度值.实验结果表明该方法可以提高信息检索的查准率和检索结果的排序合理性.  相似文献   

11.
廖开际  杨彬彬 《情报杂志》2012,31(7):182-186
基于词频统计思想的传统文本相似度算法,往往只考虑特征项在文本中的权重,而忽视了特征项之间的语义关系.综合考虑了特征项在文本中的重要程度以及特征项之间的语义关系,提出构建文本特征项的加权语义网模型来计算文本之间的相似度,并在模型构建的过程中,对特征项的选取、权值计算做了适当的改进.最后用实验验证了基于加权语义网的文本相似度算法相较于传统的算法,相似度计算的精确度有了进一步的提高.  相似文献   

12.
[目的]利用向量空间描述语义信息,研究基于词向量包的自动文摘方法;[方法]文摘是文献内容缩短的精确表达;而词向量包可以在同一个向量空间下表示词、短语、句子、段落和篇章,其空间距离用于反映语义相似度。提出一种基于词向量包的自动文摘方法,用词向量包的表示距离衡量句子与整篇文献的语义相似度,将与文献语义相似的句子抽取出来最终形成文摘;[结果]在DUC01数据集上,实验结果表明,该方法能够生成高质量的文摘,结果明显优于其它方法;[结论]实验证明该方法明显提升了自动文摘的性能。  相似文献   

13.
In this paper two distinct similarity measures in a document vector space, the distance-based and angle-based similarity measures, are compared, and a newly developed similarity measure based upon both the distance and angle strengths of two compared objects is presented. The concept of the iso-extent contour, which facilitates the understanding of the nature of the newly developed similarity measure, is introduced. The three different similarity measures are compared and the properties of the newly developed similarity measure are addressed.  相似文献   

14.
Estimating the similarity between two legal case documents is an important and challenging problem, having various downstream applications such as prior-case retrieval and citation recommendation. There are two broad approaches for the task — citation network-based and text-based. Prior citation network-based approaches consider citations only to prior-cases (also called precedents) (PCNet). This approach misses important signals inherent in Statutes (written laws of a jurisdiction). In this work, we propose Hier-SPCNet that augments PCNet with a heterogeneous network of Statutes. We incorporate domain knowledge for legal document similarity into Hier-SPCNet, thereby obtaining state-of-the-art results for network-based legal document similarity.Both textual and network similarity provide important signals for legal case similarity; but till now, only trivial attempts have been made to unify the two signals. In this work, we apply several methods for combining textual and network information for estimating legal case similarity. We perform extensive experiments over legal case documents from the Indian judiciary, where the gold standard similarity between document-pairs is judged by law experts from two reputed Law institutes in India. Our experiments establish that our proposed network-based methods significantly improve the correlation with domain experts’ opinion when compared to the existing methods for network-based legal document similarity. Our best-performing combination method (that combines network-based and text-based similarity) improves the correlation with domain experts’ opinion by 11.8% over the best text-based method and 20.6% over the best network-based method. We also establish that our best-performing method can be used to recommend/retrieve citable and similar cases for a source (query) case, which are well appreciated by legal experts.  相似文献   

15.
Measuring the similarity between the semantic relations that exist between words is an important step in numerous tasks in natural language processing such as answering word analogy questions, classifying compound nouns, and word sense disambiguation. Given two word pairs (AB) and (CD), we propose a method to measure the relational similarity between the semantic relations that exist between the two words in each word pair. Typically, a high degree of relational similarity can be observed between proportional analogies (i.e. analogies that exist among the four words, A is to B such as C is to D). We describe eight different types of relational symmetries that are frequently observed in proportional analogies and use those symmetries to robustly and accurately estimate the relational similarity between two given word pairs. We use automatically extracted lexical-syntactic patterns to represent the semantic relations that exist between two words and then match those patterns in Web search engine snippets to find candidate words that form proportional analogies with the original word pair. We define eight types of relational symmetries for proportional analogies and use those as features in a supervised learning approach. We evaluate the proposed method using the Scholastic Aptitude Test (SAT) word analogy benchmark dataset. Our experimental results show that the proposed method can accurately measure relational similarity between word pairs by exploiting the symmetries that exist in proportional analogies. The proposed method achieves an SAT score of 49.2% on the benchmark dataset, which is comparable to the best results reported on this dataset.  相似文献   

16.
Recently, using a pretrained word embedding to represent words achieves success in many natural language processing tasks. According to objective functions, different word embedding models capture different aspects of linguistic properties. However, the Semantic Textual Similarity task, which evaluates similarity/relation between two sentences, requires to take into account of these linguistic aspects. Therefore, this research aims to encode various characteristics from multiple sets of word embeddings into one embedding and then learn similarity/relation between sentences via this novel embedding. Representing each word by multiple word embeddings, the proposed MaxLSTM-CNN encoder generates a novel sentence embedding. We then learn the similarity/relation between our sentence embeddings via Multi-level comparison. Our method M-MaxLSTM-CNN consistently shows strong performances in several tasks (i.e., measure textual similarity, identify paraphrase, recognize textual entailment). Our model does not use hand-crafted features (e.g., alignment features, Ngram overlaps, dependency features) as well as does not require pre-trained word embeddings to have the same dimension.  相似文献   

17.
刘爱琴  安婷 《现代情报》2019,39(8):52-58
[目的/意义]面向非相关文献的知识关联能够促进新知识的产生,为科学研究提供了一种有效的辅助手段。[方法/过程]本文以《中国分类主题词表》为主题词受控词表,首先对文献摘要进行中文分词处理并提取主题词,利用计量分析技术和聚类技术分析文献间特征的相似、相异水平,然后基于该系统为用户检索并利用用TOP-K算法反馈用户精确结果。[结果/结论]设计了面向非相关文献的知识关联检索系统,从更细的粒度层面揭示文献之间的知识关联,为用户提供高质量的服务。  相似文献   

18.
19.
文本挖掘是基于非相关文献知识发现的核心。本文将文本挖掘的过程细化为从文献源到初始文献集子过程,从初始文献集到中间词集子过程,从中间词集到关联词集子过程。并对每一个子过程中所使用的方法进行分析比较。在此基础上对文本挖掘存在的问题进行分析,并提出改进方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号