首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 546 毫秒
1.
In this paper, we propose a new algorithm, which incorporates the relationships of concept-based thesauri into the document categorization using the k-NN classifier (k-NN). k-NN is one of the most popular document categorization methods because it shows relatively good performance in spite of its simplicity. However, it significantly degrades precision when ambiguity arises, i.e., when there exist more than one candidate category to which a document can be assigned. To remedy the drawback, we employ concept-based thesauri in the categorization. Employing the thesaurus entails structuring categories into hierarchies, since their structure needs to be conformed to that of the thesaurus for capturing relationships between categories. By referencing various relationships in the thesaurus corresponding to the structured categories, k-NN can be prominently improved, removing the ambiguity. In this paper, we first perform the document categorization by using k-NN and then employ the relationships to reduce the ambiguity. Experimental results show that this method improves the precision of k-NN up to 13.86% without compromising its recall.  相似文献   

2.
《汉语主题词表》是我国情报检索语言发展历史中的一个里程碑。在网络时代,《汉语主题词表》将得到新的发展和应用。文章针对《汉语主题词表》的现状,回顾了它的编制和修订历史,其作为情报语言检索工具,在信息组织中发挥了重要作用。对如何在知识组织中发挥作用,如何在网络环境下构筑适应计算机环境的新型词表,向网络环境下的词系统推进,作者提出了新的发展思路和策略方法。  相似文献   

3.
基于聚类的词表等级关系自动识别研究   总被引:3,自引:0,他引:3  
杜慧平  何琳 《情报科学》2008,28(11):1680-1684
词汇等级关系的识别是自动构建叙词表的重点和难点之一.基于相似度的词聚类方法,突破了按字面聚集等级关系词汇的传统做法的局限性,能够深入语义,识别出字面上无此特点的等级关系词汇.介绍了该方法并进行测试,试验结果表明该方法具有一定可行性.  相似文献   

4.
全文检索研究   总被引:11,自引:0,他引:11  
A new algorithm for automatic segmentation of Chinese word with the stop word list and post-controlled thesaurus, that has absorbed the ideas from the single-Chinese character method and the thesaurus method, is given. Based on this algorithm, a new full text retrieval mode is built.  相似文献   

5.
6.
在互联网环境下,新闻数量以海量方式增长,对其进行智能化分类、知识提取处理迫在眉睫。基于此,主要研究了如何在原有关键词词典的基础上,提出一种发现新词的方法,并将提取出的未登录词添加到原始词库中,从而构造一部数量适当、覆盖面全、更新方便的关键词词典。基于大规模的新闻语料作为实验资源,采用了一种利用N-gram算法切分,用关键词抽词词典、停用词词典等过滤筛选非专名的新词识别方法。实验结果的测评表明这一方法是简便易行的。  相似文献   

7.
Decisions in thesaurus construction and use   总被引:1,自引:0,他引:1  
A thesaurus and an ontology provide a set of structured terms, phrases, and metadata, often in a hierarchical arrangement, that may be used to index, search, and mine documents. We describe the decisions that should be made when including a term, deciding whether a term should be subdivided into its subclasses, or determining which of more than one set of possible subclasses should be used. Based on retrospective measurements or estimates of future performance when using thesaurus terms in document ordering, decisions are made so as to maximize performance. These decisions may be used in the automatic construction of a thesaurus. The evaluation of an existing thesaurus is described, consistent with the decision criteria developed here. These kinds of user-focused decision-theoretic techniques may be applied to other hierarchical applications, such as faceted classification systems used in information architecture or the use of hierarchical terms in “breadcrumb navigation”.  相似文献   

8.
叙词在网络环境中的应用   总被引:1,自引:1,他引:1  
戴剑波 《情报科学》2004,22(4):502-505
本文叙述了叙词在网络环境下的三种应用模式,在一些专业性的网站以及网关检索系统中用叙词直接标引和检索是非常的普遍;叙词由于其概念定义明确,有很好的词问关系的显示,叙词能在基于关键词检索的搜索引擎中实现检索式的扩展的功能;不同部门对所拥有的资料和图书馆等信息源一般所采用的不同的叙词表或采用分类法,在网络环境下,通过一种主题的途径来检索这些信息是信息情报界研究的一个热点,叙词在这方面有着重要的作用。  相似文献   

9.
吕美香 《情报科学》2012,(8):1160-1166
词表是图书馆和信息检索领域最重要的知识组织工具,《中国分类主题词表》是传统词表的一种,它的更新和维护一直依靠手工进行,这制约了它在数字图书馆和网络信息环境下的应用。本文介绍了一项基于统计的、从元数据的标题中抽取关键词并定位在词表中的方法。大致包括三个步骤:从标题中提取关键词;确定抽取出的关键词的专指度;将专指度高的专业词汇定位在词表中。在《中国分类主题词表》和上海图书馆提供的计算机科技领域的元数据上所进行实验,结果证明该方法是可行的。这一方法可以应用到自动标引或编目中,有一定的实用性和广阔的应用前景。  相似文献   

10.
雷晓  常春  刘伟 《情报科学》2021,39(1):135-141
【目的/意义】为保证叙词表术语收录的完整性,需要及时将领域出现但未收录的新术语补充收录到叙词表 中,结合候选词的时间及文档词频特征,从时间序列角度探索新术语的分布情况以指导新术语遴选是值得研究的 问题。【方法/过程】文章主要对词汇文档词频对应的时间序列进行研究,将时间序列进行词频归一化及时间等长预 处理,引入k-means聚类算法,对候选词汇进行基于时间序列趋势变化的聚类,探索术语以及非术语趋势变化的规 律,进而总结新术语应该满足的趋势变化特征。【结果/结论】通过聚类研究,总结得出新术语普遍处于增长趋势。 实证将处于增长状态的候选词汇遴选出来,经过专家判断,该方法可以有效从候选词汇中遴选出其中能补充到叙 词表中的新术语,该方法有比较高的准确率。【创新/局限】创新之处表现为叙词表新术语的遴选中同时考虑了时间 变化和文档词频因素,局限于数据处理规模,实证中只统计了论文关键词的词频数据。  相似文献   

11.
In this paper, we provide a new insight into clustering with a spring–mass dynamics, and propose a resulting hierarchical clustering algorithm. To realize the spectral graph partitioning as clustering, we model a weighted graph of a data set as a mass–spring dynamical system, where we regard a cluster as an oscillating single entity of a data set with similar properties. And then, we describe how oscillation modes are related with eigenvectors of a graph Laplacian matrix of the data set. In each step of the clustering, we select a group of clusters, which has the biggest number of constituent clusters. This group is divided into sub-clusters by examining an eigenvector minimizing a cost function, which is formed in such a way that subdivided clusters will be balanced with large size. To find k clusters out of non-spherical or complex data, we first transform the data into spherical clusters located on the unit sphere positioned in the (k−1)-dimensional space. In the sequel, we use the previous procedure to these transformed data. The computational experiments demonstrate that the proposed method works quite well on a variety of data sets, although its performance degrades with the degree of overlapping of data sets.  相似文献   

12.
米佳 《现代情报》2009,29(1):38-41
本文对叙词表向本体的转换做了综合性的讨论,并提出了一种基于概念的叙词表转换方法,从而实现叙词表的RDF/OWL描述。  相似文献   

13.
The disambiguation of abbreviations is a crucial step in medical knowledge organization. In the past, most scholars have focused on the problem of disambiguating medical abbreviations in single sentences; they have not systematically considered full-article abbreviation disambiguation tasks. In this work, we present a research framework for full-article medical abbreviation disambiguation (FMADRF) based on the structural characteristics of abbreviation–definition pairs in a full scientific medical article. Our method utilizes the information including context semantic information, external linguistic features, and the mapping relationships and structural similarities between abbreviations and their expansions. The model includes a four-pronged approach, identification of abbreviations and abbreviation–definition pairs, alignment and complementation of abbreviations and abbreviation expansions. The results show that our novel BBF-BLC-R model improves the recognition and modification effects of abbreviation–definition pairs, achieving the best F1 score of 91.83%. Furthermore, our new strategy combines semantic and structural information to significantly improve the effects of term alignment, with an F1 score of 97.11%. In our test, a thesaurus of abbreviations and their expansions was constructed from 13,472 full-text medical articles, resulting in 14,742 abbreviations, with 31,327 corresponding expansions. This work enhances the semantic association of terms in full medical texts, eliminating the problems of “rich” semantics and association–relation roadblocks caused by term misalignments. It further provides technical and methodological support for the organization of medical knowledge, facilitating the deep knowledge-mining capabilities of full-text medical articles.  相似文献   

14.
15.
基于航天叙词表的领域本体半自动化构建研究   总被引:2,自引:0,他引:2  
文章在基于叙词表的本体构建方法基础上,从该方法本体构建现状研究入手,针对基于叙词表向领域本体转化的一系列问题,如叙词表词间一些不确定关系表示,构建过程的OWL关系表示的细化以及叙词表转化为本体后的维护扩展等,对本体和叙词表的相关知识进行论述,并利用OWL语言来表示和描述叙词表的叙词及词间的相关关系,提出从叙词表向本体转化的理论实践方法。  相似文献   

16.
We present a 3-staged method for automated learning of the spatial density function of the mass of all gravitating matter in a real galaxy, for which, data exist on the observable phase space coordinates of a sample of resident galactic particles that trace the galactic gravitational potential. We learn this gravitational mass density function, by embedding it in the domain of the probability density function (pdf) of the phase space vector variable, where we learn this pdfas well, given the data. We generate values of each sought function, at a design value of its input, to learn vectorised versions of each function; this creates the training data, using which we undertake supervised learning of each function, to thereafter undertake predictions and forecasting of the functional value, at test inputs. We assume that the phase space that a kinematic data set is sampled from, is isotropic, and we quantify the relative violation of this assumption, in a given data set. Illustration of the method is made to the real elliptical galaxy NGC4649. The purpose of this learning is to produce a data-driven protocol that allows for computation of dark matter content in any example real galaxy, without relying on system- specific astronomical details, while undertaking objective quantification of support in the data for undertaken model assumptions.  相似文献   

17.
Authors and searchers usually express the same things in many different ways, which causes problems in free text searching of text databases. Thus, a switching tool connecting the different names of one concept is needed. This study tests the effectiveness of a thesaurus as a search-aid in free text searching of a full text database. A set of queries was searched against a large full text database of newspaper articles. The search-aid thesaurus constructed for the test contains the usual relationships of a thesaurus, namely equivalence, hierarchical, and associative relationships. Each query was searched in five distinct modes: basic search, synonym search, narrower term search, related term search, and union of all previous searches. The basic searches contained only terms included in the original query statements. In the synonym searches, the terms of the basic search were extended by disjunction of the synonyms given by the search-aid thesaurus without modifying the overall logic of the basic search. Likewise, the basic search was extended in turn with the narrower terms and with the related terms given by the search-aid thesaurus. The last search mode included the basic terms and all the terms used in the previous searches. The searches were analyzed in terms of relative recall and precision; relative recall was estimated by setting the recall of the union search to 100%. On the average the value of relative recall was 47.2% in the basic search, compared with 100% in the union search; the average value of precision decreased only from 62.5% in the basic search to 51.2% in the union search.  相似文献   

18.
Many traditional works on off-line Thai handwritten character recognition used a set of local features including circles, concavity, endpoints and lines to recognize hand-printed characters. However, in natural handwriting, these local features are often missing due to rough or quick writing, resulting in dramatic reduction of recognition accuracy. Instead of using such local features, this paper presents a method called multi-directional island-based projection to extract global features from handwritten characters. As the recognition model, two statistical approaches, namely interpolated n-gram model (n-gram) and hidden Markov model (HMM), are proposed. The experimental results indicate that the proposed scheme achieves high accuracy in the recognition of naturally-written Thai characters with numerous variations, compared to some common previous feature extraction techniques. Another experiment with English characters also displays quite promising results.  相似文献   

19.
20.
本体转化使叙词表的网络化成为可能,本文阐述了OWL语言的分类、OWL描述网络叙词表的方法及叙词表本体转化的过程和实例等,为叙词表的网络化和智能化发展提供了技术上的参考。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号