首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
一个基于本体论全文自动标引方案   总被引:5,自引:1,他引:5  
王泰森 《情报科学》2003,21(9):950-952
本文为支持数字图书馆全文检索精度的提高,提出了一个基于本体论全文自动标引方案。该方案利用本体论的方法,强调词与词之间的内在概念联系,着重解决传统的人工标引不能全面概括全文,而且词与词之间缺乏概念性的连接,很难反映文件主题的全面内容及由于多义词、同义词等的原因造成漏检或检索结果返回信息太多,失去检索意义,达不到理想效果的问题。并为数字图书馆在进行主题标引时实现自动化操作。  相似文献   

2.
[目的/意义]基于文本挖掘技术自动发现更具代表性的文献内容主题词,通过定位主题词在章节中的具体位置,并基于可视化技术进行主题标引,帮助读者直观高效发现文献主题间的潜在关系。[方法/过程]基于文本挖掘技术深入文献内容层挖掘主题词,并利用可视化工具直观呈现所获信息,在此基础上尝试构建可视化主题自动标引系统,并在格萨尔领域的多个主题中对该系统的自动标引效果进行验证。[结果/结论]研究结果显示,该标引方法在格萨尔领域实现了文献内容级的可视化主题自动标引,快速精准地定位到章节、段落和句子。标引相关信息获取过程直观可视,并且具有交互性,可提升用户体验和参与度。文章以《英雄格萨尔》为例完成系统验证,但该标引方法技术本身无领域限定,可应用于其他领域的文献。  相似文献   

3.
We have developed methods for storing and retrieving large dictionaries of word pairs and other multi-word phrases based on hashed indexing. From analysis of text samples we have derived Zipfian laws for the frequency distributions of word pairs and longer phrases. We show where these Zipfian curves cross and deduce that the number of multi-word phrases which occur frequently in text is surprisingly small, of the same order of magnitude as the number of individual word-types in a text. Dictionaries of phrases are therefore amenable to fast processing with modest computer equipment. Finally, we suggest that in stylistic analysis word phrases might better discriminate between authors than do single words.  相似文献   

4.
In Korean text, foreign words, which are mostly transliterations of English words, are frequently used. Foreign words are usually very important index terms in Korean information retrieval since most of them are technical terms or names. So accurate foreign word extraction is crucial for high performance of information retrieval. However, accurate foreign word extraction is not easy because it inevitably accompanies word segmentation and most of the foreign words are unknown. In this paper, we present an effective foreign word recognition and extraction method. In order to accurately extract foreign words, we developed an effective method of word segmentation that involves unknown foreign words. Our word segmentation method effectively utilizes both unknown word information acquired through the automatic dictionary compilation and foreign word recognition information. Our HMM-based foreign word recognition method does not require large labeled examples for the model training unlike the previously proposed method.  相似文献   

5.
This work addresses the information retrieval problem of auto-indexing Arabic documents. Auto-indexing a text document refers to automatically extracting words that are suitable for building an index for the document. In this paper, we propose an auto-indexing method for Arabic text documents. This method is mainly based on morphological analysis and on a technique for assigning weights to words. The morphological analysis uses a number of grammatical rules to extract stem words that become candidate index words. The weight assignment technique computes weights for these words relative to the container document. The weight is based on how spread is the word in a document and not only on its rate of occurrence. The candidate index words are then sorted in descending order by weight so that information retrievers can select the more important index words. We empirically verify the usefulness of our method using several examples. For these examples, we obtained an average recall of 46% and an average precision of 64%.  相似文献   

6.
[目的/意义]针对技术功效图构建过程中的主要问题和薄弱环节,提出了一种基于SAO结构和词向量的专利技术功效图构建方法。[方法/过程]利用Python程序获取专利摘要中的SAO结构,从中识别技术词和功效词;结合领域词典与专利领域语料库,运用Word2Vec和WordNet计算词语间的语义相似度;利用基于网络关系的主题聚类算法实现主题的自动标引;采用基于SAO结构的共现关系构建技术功效矩阵。[结果/结论]实现了基于SAO结构和词向量的技术功效图自动构建,该构建方法提高了构建技术功效主题的合理性和专利分类标注的准确性,为技术功效图的自动化构建提供新的思路。  相似文献   

7.
文章介绍自动标引技术的发展现状,并将自动标引技术应用于政府信息公开的标引工作中,针对政府信息公开工作中存在的问题和不足,运用统计加权算法,将词频统计、位置加权、词共现统计三者相结合,设计实现了基于关键词的政府信息公开的自动标引。  相似文献   

8.
This study attempted to use semantic relations expressed in text, in particular cause-effect relations, to improve information retrieval effectiveness. The study investigated whether the information obtained by matching cause-effect relations expressed in documents with the cause-effect relations expressed in users’ queries can be used to improve document retrieval results, in comparison to using just keyword matching without considering relations.An automatic method for identifying and extracting cause-effect information in Wall Street Journal text was developed. Causal relation matching was found to yield a small but significant improvement in retrieval results when the weights used for combining the scores from different types of matching were customized for each query. Causal relation matching did not perform better than word proximity matching (i.e. matching pairs of causally related words in the query with pairs of words that co-occur within document sentences), but the best results were obtained when causal relation matching was combined with word proximity matching. The best kind of causal relation matching was found to be one in which one member of the causal relation (either the cause or the effect) was represented as a wildcard that could match with any word.  相似文献   

9.
Compact graphic display of phrases from the original text is among abstracting assistance features being prototyped in the TEXNET text network management system. Compaction is achieved by embedding subphrases and by enabling the user to select rapidly word by word. Phrases displayed would not necessarily be those selected for automatic indexing.  相似文献   

10.
Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms.  相似文献   

11.
A suite of computer programs has been developed for representing the full text of lengthy documents in vector form and classifying them by a clustering method. The programs have been applied to the full text of the Conventions and Agreements of the Council of Europe which consist of some 280,000 words in the English version and a similar number in the French. Results of the clustering experiments are presented in the form of dendrograms (tree diagrams) using both the treaty and article as the clustering unit. The conclusion is that vector techniques based on the full text provide an effective method of classifying legal documents.  相似文献   

12.
A procedure for automated indexing of pathology diagnostic reports at the National Institutes of Health is described. Diagnostic statements in medical English are encoded by computer into the Systematized Nomenclature of Pathology (SNOP). SNOP is a structured indexing language constructed by pathologists for manual indexing. It is of interest that effective automatic encoding can be based upon an existing vocabulary and code designed for manual methods. Morphosyntactic analysis, a simple syntax analysis, matching of dictionary entries consisting of several words, and synonym substitutions are techniques utilized.  相似文献   

13.
Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The system learns the similarity matrix between word pairs from the unlabeled corpus, and it uses the vector representations of sense definitions from MRD, which are derived based on the similarity matrix. In order to disambiguate all occurrences of polysemous words in a sentence, the system separately constructs the acyclic weighted digraph (AWD) for every occurrence of polysemous words in a sentence. The AWD is structured based on consideration of the senses of context words which occur with a target word in a sentence. After building the AWD per each polysemous word, we can search the optimal path of the AWD using the Viterbi algorithm. We assign the most appropriate sense to the target word in sentences with the sense on the optimal path in the AWD. By experiments, our system shows 76.4% accuracy for the semantically ambiguous Korean words.  相似文献   

14.
Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The system learns the similarity matrix between word pairs from the unlabeled corpus, and it uses the vector representations of sense definitions from MRD, which are derived based on the similarity matrix. In order to disambiguate all occurrences of polysemous words in a sentence, the system separately constructs the acyclic weighted digraph (AWD) for every occurrence of polysemous words in a sentence. The AWD is structured based on consideration of the senses of context words which occur with a target word in a sentence. After building the AWD per each polysemous word, we can search the optimal path of the AWD using the Viterbi algorithm. We assign the most appropriate sense to the target word in sentences with the sense on the optimal path in the AWD. By experiments, our system shows 76.4% accuracy for the semantically ambiguous Korean words.  相似文献   

15.
潜在语义索引方法是一种无监督的学习方法,能够自动地从未经加工的文本中学习词法分析处理的数据。通过计算单词之间的语义相关性,提高学习的效果。本文首先对词法分析和词法学习的概念和早期出现过的词法学习的方法进行简单阐述,然后描述了基于这一理论进行词法学习的方法,接着是对这一方法的一些改进和测评,最后是结论和展望。  相似文献   

16.
Recently, sentiment classification has received considerable attention within the natural language processing research community. However, since most recent works regarding sentiment classification have been done in the English language, there are accordingly not enough sentiment resources in other languages. Manual construction of reliable sentiment resources is a very difficult and time-consuming task. Cross-lingual sentiment classification aims to utilize annotated sentiment resources in one language (typically English) for sentiment classification of text documents in another language. Most existing research works rely on automatic machine translation services to directly project information from one language to another. However, different term distribution between original and translated text documents and translation errors are two main problems faced in the case of using only machine translation. To overcome these problems, we propose a novel learning model based on active learning and semi-supervised co-training to incorporate unlabelled data from the target language into the learning process in a bi-view framework. This model attempts to enrich training data by adding the most confident automatically-labelled examples, as well as a few of the most informative manually-labelled examples from unlabelled data in an iterative process. Further, in this model, we consider the density of unlabelled data so as to select more representative unlabelled examples in order to avoid outlier selection in active learning. The proposed model was applied to book review datasets in three different languages. Experiments showed that our model can effectively improve the cross-lingual sentiment classification performance and reduce labelling efforts in comparison with some baseline methods.  相似文献   

17.
赖娟 《科技通报》2012,28(2):152-154
研究了中文词自动分类问题。针对传统的蚁群算法中文词语分类精确度低等问题,提出了一种将蚁群算法应用到了中文词语自动分类中。方法建立在首先对大规模语料文本进行统计和计算的基础上,得到词的一元和二元信息,然后采用了蚁群算法对该信息进行词的分类。实验结果表明,提出的算法有效提高了词语分类的精确度。  相似文献   

18.
Arabic is a morphologically rich language that presents significant challenges to many natural language processing applications because a word often conveys complex meanings decomposable into several morphemes (i.e. prefix, stem, suffix). By segmenting words into morphemes, we could improve the performance of English/Arabic translation pair’s extraction from parallel texts. This paper describes two algorithms and their combination to automatically extract an English/Arabic bilingual dictionary from parallel texts that exist in the Internet archive after using an Arabic light stemmer as a preprocessing step. Before using the Arabic light stemmer, the total system precision and recall were 88.6% and 81.5% respectively, then the system precision an recall increased to 91.6% and 82.6% respectively after applying the Arabic light stemmer on the Arabic documents.  相似文献   

19.
Today, due to a vast amount of textual data, automated extractive text summarization is one of the most common and practical techniques for organizing information. Extractive summarization selects the most appropriate sentences from the text and provide a representative summary. The sentences, as individual textual units, usually are too short for major text processing techniques to provide appropriate performance. Hence, it seems vital to bridge the gap between short text units and conventional text processing methods.In this study, we propose a semantic method for implementing an extractive multi-document summarizer system by using a combination of statistical, machine learning based, and graph-based methods. It is a language-independent and unsupervised system. The proposed framework learns the semantic representation of words from a set of given documents via word2vec method. It expands each sentence through an innovative method with the most informative and the least redundant words related to the main topic of sentence. Sentence expansion implicitly performs word sense disambiguation and tunes the conceptual densities towards the central topic of each sentence. Then, it estimates the importance of sentences by using the graph representation of the documents. To identify the most important topics of the documents, we propose an inventive clustering approach. It autonomously determines the number of clusters and their initial centroids, and clusters sentences accordingly. The system selects the best sentences from appropriate clusters for the final summary with respect to information salience, minimum redundancy, and adequate coverage.A set of extensive experiments on DUC2002 and DUC2006 datasets was conducted for investigating the proposed scheme. Experimental results showed that the proposed sentence expansion algorithm and clustering approach could considerably enhance the performance of the summarization system. Also, comparative experiments demonstrated that the proposed framework outperforms most of the state-of-the-art summarizer systems and can impressively assist the task of extractive text summarization.  相似文献   

20.
Documents in computer-readable form can be used to provide information about other documents, i.e. those they cite.To do this efficiently requires procedures for computer recognition of citing statements. This is not easy, especially for multi-sentence citing statements. Computer recognition procedures have been developed which are accurate to the following extent: 73% of the words in statements selected by computer procedures as being citing statements are words which are correctly attributable to the corresponding documents.The retrieval effectiveness of computer-recognized citing statements was tested in the following way. First, for eight retrieval requests in inorganiic chemistry, average recall by search of Chemical Abstracts Service indexing and Chemical Abstracts abstract text words was found to be 50%. Words from citing statements referring to the papers to be retrieved were then added to the index terms and abstract words as additional access points, and searching was repeated. Average recall increased to 70%. Only words from citing statements published within a year of the cited papers were used.The retrieval effect of citing statement words alone (published within a year) without index or abstract terms was the following: average recall was 40%. When just the words of the titles of the cited papers were added to those citing statement words, average recall increased to 50%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号