首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
基于领域本体的文献模糊相似度算法研究   总被引:1,自引:0,他引:1  
利用分类主题一体化的主题词表构建领域本体,并通过概念间的关系定义及语义相似度公式,引入调整因子,确定概念相似度算法,再通过余弦系数法进一步得到文献间的相似度。对于本算法的结果,与领域专家预测的相似度进行比较,结果证实该算法有效。  相似文献   

2.
This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. These fixed-length vectors are then compared to each other through a Manhattan distance computation for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pages in terms of their layout similarity to a test document. We also demonstrate the usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible. The methods described in the paper can be used in various document retrieval tasks including visual similarity based retrieval, categorization and information extraction.  相似文献   

3.
Exploiting the Similarity of Non-Matching Terms at Retrieval Time   总被引:2,自引:0,他引:2  
In classic Information Retrieval systems a relevant document will not be retrieved in response to a query if the document and query representations do not share at least one term. This problem, known as term mismatch, has been recognised for a long time by the Information Retrieval community and a number of possible solutions have been proposed. Here I present a preliminary investigation into a new class of retrieval models that attempt to solve the term mismatch problem by exploiting complete or partial knowledge of term similarity in the term space. The use of term similarity enables to enhance classic retrieval models by taking into account non-matching terms. The theoretical advantages and drawbacks of these models are presented and compared with other models tackling the same problem. A preliminary experimental investigation into the performance gain achieved by exploiting term similarity with the proposed models is presented and discussed.  相似文献   

4.
一种基于后缀树的Web搜索结果聚类方法   总被引:3,自引:2,他引:1  
为同时满足Web搜索结果聚类的关联性、快速性以及类别描述的可浏览性等需求,本文提出了一种适合中文Web信息搜索结果的后缀树聚类算法,其中后缀树的构建以中文汉字为基本单位,一种有效的策略解决了基于二进制方法合并短语类后的类别描述问题,利用短语类语义层面的相似性合并同义短语类,有效地改善了聚类结果的质量.测试结果表明:与传统的文档聚类算法相比,基于后缀树的算法在Web文档聚类的精度和效率方面具有较强的优越性.  相似文献   

5.
基于句子相似度的文档复制检测算法研究   总被引:3,自引:0,他引:3  
提出一种基于句子相似度的文档复制检测技术,在抓住文档的全局特征的同时又兼顾文档的结构信息,克服以往检测算法两者不可兼顾的缺陷,提高检测精度。最后,给出该算法与其他算法检测结果的比较情况。实验证明,该算法是可行的。  相似文献   

6.
The paper proposes a Vector Space Model over the Cayley-Klein Hyperbolic Geometry (referred to as Hyperbolic Information Retrieval = HIR) using a similarity measure derived from the hyperbolic distance. It is shown that the proposed model is equivalent with the classical Vector Space Model using Cosine measure with normalized weighting scheme. It is also shown that the categoricity of the new retrieval system can be varied by only modifying the radius of the hyperbolic space and without using a different weighting scheme and similarity measure, which is not the case in the VSM, where the same effect can only be obtained by both changing the weighting scheme and similarity measure at the expense of a more costly computation. Experiments are also reported to demonstrate and support the ideas, and they show that categoricity in HIR can be varied more than O(n) faster, where n is the number of index terms, than in the VSM.  相似文献   

7.
[目的/意义]针对目前医学领域基于主题的语义相似度计算研究较少,尚不足以揭示主题间在语义层面的关系,提出一套用于主题间语义相似度计算的方法,进而从语义角度判断主题间关系,为主题新颖性判断、主题关联研究等提供参考。[方法/过程]以MeSH词表为语义计算的基础,剖析词表结构与现有研究成果,从入口词、语义距离、注释3个维度综合测度主题间的语义相似度,利用PubMed中2011-2014年干细胞领域的文献进行实证研究。[结果/结论]利用通用验证主题词对,验证了本文所提3个测度维度的有效性。通过主题间语义相似度的计算,发现干细胞领域2011-2014年较为新颖的主题为未成年人干细胞研究。后续研究中还需融入基于统计的主题相似度,从而更加全面地揭示主题间的关系,发现语义层面领域的新颖性研究主题。  相似文献   

8.
A similarity comparison is made between 120 journals from five allied Web of Science disciplines (Communication, Computer Science-Information Systems, Education & Educational Research, Information Science & Library Science, Management) and a more distant discipline (Geology) across three time periods using a novel method called citing discipline analysis that relies on the frequency distribution of Web of Science Research Areas for citing articles. Similarities among journals are evaluated using multidimensional scaling with hierarchical cluster analysis and Principal Component Analysis. The resulting visualizations and groupings reveal clusters that align with the discipline assignments for the journals for four of the six disciplines, but also greater overlaps among some journals for two of the disciplines or categorizations that do not necessarily align with their assigned disciplines. Some journals categorized into a single given discipline were found to be more closely aligned with other disciplines and some journals assigned to multiple disciplines more closely aligned with only one of the assigned disciplines. The proposed method offers a complementary way to more traditional methods such as journal co-citation analysis to compare journal similarity using data that are readily available through Web of Science.  相似文献   

9.
Simple Semantics in Topic Detection and Tracking   总被引:3,自引:0,他引:3  
Topic Detection and Tracking (TDT) is a research initiative that aims at techniques to organize news documents in terms of news events. We propose a method that incorporates simple semantics into TDT by splitting the term space into groups of terms that have the meaning of the same type. Such a group can be associated with an external ontology. This ontology is used to determine the similarity of two terms in the given group. We extract proper names, locations, temporal expressions and normal terms into distinct sub-vectors of the document representation. Measuring the similarity of two documents is conducted by comparing a pair of their corresponding sub-vectors at a time. We use a simple perceptron to optimize the relative emphasis of each semantic class in the tracking and detection decisions. The results suggest that the spatial and the temporal similarity measures need to be improved. Especially the vagueness of spatial and temporal terms needs to be addressed.  相似文献   

10.
[目的/意义]施引文献与被引文献往往存在着某种相似性,揭示这种现象背后的形成机制有助于深入理解引文的本质。[方法/过程]采用指数随机图模型,以图书馆与情报学领域为对象开展实证分析,旨在揭示文献相似性对引用关系的影响机制。[结果/结论]实证研究发现:在网络结构、机构、期刊层面存在显著的引用文献相似倾向。具体地,引用关系更倾向于嵌入三角传递结构;来源于相同机构和期刊的文献之间更容易产生引用关系;来源于学科优势地位国家的文献之间更容易产生引用。实证结果充分说明社会接近性是引用行为的重要形成机制,反映了引用偏好的社会属性。  相似文献   

11.
基于模糊语义距离的多媒体信息检索方法研究   总被引:4,自引:1,他引:3  
张李义 《情报学报》2003,22(2):131-135
与传统的数据库精确查询不同 ,多媒体信息检索的查询条件是不完备的。本文叙述利用模糊语义距离来检索多媒体数据库中信息的原理、算法 ,并将模糊相似测试作为检索结果判断标准 ,最后通过一个示例来说明本方法的使用。  相似文献   

12.
In Information Retrieval, since it is hard to identify users’ information needs, many approaches have been tried to solve this problem by expanding initial queries and reweighting the terms in the expanded queries using users’ relevance judgments. Although relevance feedback is most effective when relevance information about retrieved documents is provided by users, it is not always available. Another solution is to use correlated terms for query expansion. The main problem with this approach is how to construct the term-term correlations that can be used effectively to improve retrieval performance. In this study, we try to construct query concepts that denote users’ information needs from a document space, rather than to reformulate initial queries using the term correlations and/or users’ relevance feedback. To form query concepts, we extract features from each document, and then cluster the features into primitive concepts that are then used to form query concepts. Experiments are performed on the Associated Press (AP) dataset taken from the TREC collection. The experimental evaluation shows that our proposed framework called QCM (Query Concept Method) outperforms baseline probabilistic retrieval model on TREC retrieval.  相似文献   

13.
Exploiting Hierarchy in Text Categorization   总被引:4,自引:3,他引:1  
With the recent dramatic increase in electronic access to documents, text categorization—the task of assigning topics to a given document—has moved to the center of the information sciences and knowledge management. This article uses the structure that is present in the semantic space of topics in order to improve performance in text categorization: according to their meaning, topics can be grouped together into meta-topics, e.g., gold, silver, and copper are all metals. The proposed architecture matches the hierarchical structure of the topic space, as opposed to a flat model that ignores the structure. It accommodates both single and multiple topic assignments for each document. Its probabilistic interpretation allows its predictions to be combined in a principled way with information from other sources. The first level of the architecture predicts the probabilities of the meta-topic groups. This allows the individual models for each topic on the second level to focus on finer discriminations within the group. Evaluating the performance of a two-level implementation on the Reuters-22173 testbed of newswire articles shows the most significant improvement for rare classes.  相似文献   

14.
传统的Web文本分类方法将文本中关键词的相似度作为分类的依据,丢失了很多重要的语义信息,导致分类结果不够准确且计算量大。基于此,文章提出了一种基于语义相似度的Web文本分类方法,利用领域本体将用关键词表示的文本特征向量表示为与之匹配的语义概念特征向量集,定义Web文本相似度的计算公式,设计并实现基于语义相似度的KNN算法。实验结果表明,该方法从语义概念层次上表示和处理Web文本,降低了文本特征空间维度,减少了计算量,提高了分类精确度。  相似文献   

15.
一种基于网页分割的Web信息检索方法   总被引:2,自引:0,他引:2  
提出一种基于网页内容分割的Web信息检索算法。该算法根据网页半结构化的特点,按照HTML标记和网页的内容将网页进行区域分割。在建立HTML标记树的基础上,利用了的内容相似性和视觉相似性进行节点的整合。在检索和排序中,根据用户的查询,充分利用了区域信息来对相关的检索结果进行排序。  相似文献   

16.
针对目前电子商务推荐系统中存在的核心问题--相似度,提出借助Vague集理论研究推荐系统的思想。电子商务过程中顾客行为不确定性的存在,为Vague集的引入提供理论基础。商品推荐依赖的是商品间或顾客间的相似程度,而相似度的计算正是Vague集研究较为成熟的一个领域。根据一般电子商务购物方式,确定不同的顾客类型,在顾客分类的基础上,利用统计方法定义商品的Vague值,实现电子商务推荐系统与Vague的完美结合,并通过相似度的计算验证该方法的有效性,从而为推荐系统的研究提供新的思路和方法。
  相似文献   

17.
一种新的数字图书馆图像检索算法   总被引:1,自引:0,他引:1  
提出一种适应图书馆特点的视觉特征和高层语义相结合的图像检索算法,通过相关反馈构建了动态的相似性度量方程。实验结果表明,综合视觉特征和语义特征的检索比仅利用视觉特征的检索能获得更高的检索率。  相似文献   

18.
Most recent document standards like XML rely on structured representations. On the other hand, current information retrieval systems have been developed for flat document representations and cannot be easily extended to cope with more complex document types. The design of such systems is still an open problem. We present a new model for structured document retrieval which allows computing scores of document parts. This model is based on Bayesian networks whose conditional probabilities are learnt from a labelled collection of structured documents—which is composed of documents, queries and their associated assessments. Training these models is a complex machine learning task and is not standard. This is the focus of the paper: we propose here to train the structured Bayesian Network model using a cross-entropy training criterion. Results are presented on the INEX corpus of XML documents.  相似文献   

19.
作者合著网络中研究兴趣相似性实证研究   总被引:2,自引:0,他引:2  
[目的/意义]从作者微观个体研究兴趣角度出发,通过对作者合著网络中作者关联关键词集的研究,定量地验证研究兴趣相似是作者合作的一个动机。[方法/过程]收集WOS中检索领域相关文献题录信息,构建作者合著网络,并利用Louvain算法划分社区,实现了Jaccard系数及余弦相似性系数的计算指标,统计与对比分析整体网络及社区内部作者研究兴趣的相似性。[结果/结论]在网络整体层次,作者合著网络中作者的研究兴趣相似性较高,但也存在一定比例的差异性即互补性;在科研社区内部,合著作者平均研究兴趣相似性及互补性均高于网络整体层次,科研社区的形成受到作者研究兴趣的影响。两个层次的兴趣相似性反映了研究兴趣相似是作者合作的一个重要动机。  相似文献   

20.
XML文档相似度计算方法研究   总被引:1,自引:0,他引:1  
XML(可扩展标记语言)正在成为Web上各种应用交换信息的标准.随着XML格式的半结构数据的大量出现,如何处理和管理XML文档已经成为了一个研究热点.XML文档的相似度计算是XML数据处理的重要课题,是XML文档聚类与检索的关键技术.XML文档由逻辑结构(structure)和文本内容(content)构成,可以根据结构特征或内容特征来度量XML文档之间的相似度.本文将XML文档的相似度计算方法分为基于结构的和结构与内容相结合的两类,并对各种已有的XML文档相似度计算方法进行了比较和述评.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号