首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The fundamental idea of the work reported here is to extract index phrases from texts with the help of a single word concept dictionary and a thesaurus containing relations among concepts. The work is based on the fact, that, within every phrase, the single words the phrase is composed of are related in a certain well denned manner, the type of relations holding between concepts depending only on the concepts themselves. Therefore relations can be stored in a semantic network. The algorithm described extracts single word concepts from texts and combines them to phrases using the semantic relations between these concepts, which are stored in the network. The results obtained show that phrase extraction from texts by this semantic method is possible and offers many advantages over other (purely syntactic or statistic) methods concerning preciseness and completeness of the meaning representation of the text. But the results show, too, that some syntactic and morphologic “filtering” should be included for effectivity reasons.  相似文献   

2.
Semantic representation reflects the meaning of the text as it may be understood by humans. Thus, it contributes to facilitating various automated language processing applications. Although semantic representation is very useful for several applications, a few models were proposed for the Arabic language. In that context, this paper proposes a graph-based semantic representation model for Arabic text. The proposed model aims to extract the semantic relations between Arabic words. Several tools and concepts have been employed such as dependency relations, part-of-speech tags, name entities, patterns, and Arabic language predefined linguistic rules. The core idea of the proposed model is to represent the meaning of Arabic sentences as a rooted acyclic graph. Textual entailment recognition challenge is considered in order to evaluate the ability of the proposed model to enhance other Arabic NLP applications. The experiments have been conducted using a benchmark Arabic textual entailment dataset, namely, ArbTED. The results proved that the proposed graph-based model is able to enhance the performance of the textual entailment recognition task in comparison to other baseline models. On average, the proposed model achieved 8.6%, 30.2%, 5.3% and 16.2% improvement in terms of accuracy, recall, precision, and F-score results, respectively.  相似文献   

3.
Cross-language plagiarism detection aims to detect plagiarised fragments of text among documents in different languages. In this paper, we perform a systematic examination of Cross-language Knowledge Graph Analysis; an approach that represents text fragments using knowledge graphs as a language independent content model. We analyse the contributions to cross-language plagiarism detection of the different aspects covered by knowledge graphs: word sense disambiguation, vocabulary expansion, and representation by similarities with a collection of concepts. In addition, we study both the relevance of concepts and their relations when detecting plagiarism. Finally, as a key component of the knowledge graph construction, we present a new weighting scheme of relations between concepts based on distributed representations of concepts. Experimental results in Spanish–English and German–English plagiarism detection show state-of-the-art performance and provide interesting insights on the use of knowledge graphs.  相似文献   

4.
Automatic text summarization attempts to provide an effective solution to today’s unprecedented growth of textual data. This paper proposes an innovative graph-based text summarization framework for generic single and multi document summarization. The summarizer benefits from two well-established text semantic representation techniques; Semantic Role Labelling (SRL) and Explicit Semantic Analysis (ESA) as well as the constantly evolving collective human knowledge in Wikipedia. The SRL is used to achieve sentence semantic parsing whose word tokens are represented as a vector of weighted Wikipedia concepts using ESA method. The essence of the developed framework is to construct a unique concept graph representation underpinned by semantic role-based multi-node (under sentence level) vertices for summarization. We have empirically evaluated the summarization system using the standard publicly available dataset from Document Understanding Conference 2002 (DUC 2002). Experimental results indicate that the proposed summarizer outperforms all state-of-the-art related comparators in the single document summarization based on the ROUGE-1 and ROUGE-2 measures, while also ranking second in the ROUGE-1 and ROUGE-SU4 scores for the multi-document summarization. On the other hand, the testing also demonstrates the scalability of the system, i.e., varying the evaluation data size is shown to have little impact on the summarizer performance, particularly for the single document summarization task. In a nutshell, the findings demonstrate the power of the role-based and vectorial semantic representation when combined with the crowd-sourced knowledge base in Wikipedia.  相似文献   

5.
This paper is concerned with paraphrase detection, i.e., identifying sentences that are semantically identical. The ability to detect similar sentences written in natural language is crucial for several applications, such as text mining, text summarization, plagiarism detection, authorship authentication and question answering. Recognizing this importance, we study in particular how to address the challenges with detecting paraphrases in user generated short texts, such as Twitter, which often contain language irregularity and noise, and do not necessarily contain as much semantic information as longer clean texts. We propose a novel deep neural network-based approach that relies on coarse-grained sentence modelling using a convolutional neural network (CNN) and a recurrent neural network (RNN) model, combined with a specific fine-grained word-level similarity matching model. More specifically, we develop a new architecture, called DeepParaphrase, which enables to create an informative semantic representation of each sentence by (1) using CNN to extract the local region information in form of important n-grams from the sentence, and (2) applying RNN to capture the long-term dependency information. In addition, we perform a comparative study on state-of-the-art approaches within paraphrase detection. An important insight from this study is that existing paraphrase approaches perform well when applied on clean texts, but they do not necessarily deliver good performance against noisy texts, and vice versa. In contrast, our evaluation has shown that the proposed DeepParaphrase-based approach achieves good results in both types of texts, thus making it more robust and generic than the existing approaches.  相似文献   

6.
左晓飞  刘怀亮  范云杰  赵辉 《情报杂志》2012,31(5):180-184,191
传统的基于关键词的文本聚类算法,由于难以充分利用文本的语义特征,聚类效果差强人意。笔者提出一种概念语义场的概念,并给出了基于知网构建概念语义场的算法。即首先利用知网构造义原屏蔽层,将描述能力弱的义原屏蔽,然后在分析知网结构的基础上给出抽取相关概念的规则,以及简单概念语义场和复杂概念语义场的构造方法。最后给出一种基于概念语义场的文本聚类算法。该算法可充分利用特征词的语义关系,对不规则形状的聚类也有较好效果。实验表明,该算法可以有效提高聚类的质量。  相似文献   

7.
贾君枝  董刚 《情报科学》2007,25(11):1682-1686
FrameNet、WordNet、VerbNet作为语义型词典,在自然语言处理及其电子词典编纂领域得到广泛应用。这三种语义类型的词典各自从不同侧面表达词汇概念及语义关系,彼此之间互为补充,并且建立相互之间的映射,共同为语义分析提供了丰富知识资源。但这些语义型词典构建理论基础不同,形成各自明显的特征,因此文章从理论基础、组织结构、语义关系、应用范围四个层面上对这三种语义词典进行具体比较分析,明确其各自侧重点及差异,以帮助词典使用者及语言信息处理人员更好地应用。  相似文献   

8.
围绕文本聚类中的文本表示和相似度计算两个基本的问题,对目前学界提出的文本表示方法和相似度计算方法进行了分类和较为全面的综述,将文本表示模型分为向量空间模型、语言模型、后缀树模型、本体等,相似度计算方法分为基于向量空间模型的相似度计算,基于短语的相似度计算方法和基于本体的相似度计算方法。  相似文献   

9.
This paper presents an overview of automatic methods for building domain knowledge structures (domain models) from text collections. Applications of domain models have a long history within knowledge engineering and artificial intelligence. In the last couple of decades they have surfaced noticeably as a useful tool within natural language processing, information retrieval and semantic web technology. Inspired by the ubiquitous propagation of domain model structures that are emerging in several research disciplines, we give an overview of the current research landscape and some techniques and approaches. We will also discuss trade-offs between different approaches and point to some recent trends.  相似文献   

10.
We need to access objective information efficiently and arbitrary strings in the text at high speed. In several key retrieval strategies, we often use the binary trie for supporting fast access method in order. Especially, the Patricia trie (Pat tree) is famous as the fastest access method in binary tries, because it has the shallowest tree structure. However, the Pat tree requires many good physician storage spaces in memory, if key set registered is large. Thereby, an expense problem happens when storing this trie to the main storage unit. We already proposed a method that use compact bit stream and compress a Pat tree to solve this problem. This is called Compact Patricia trie (CPat tree). This CPat tree needs capacity of only a very few memory device. However, if a size of key set increases, the time expense that search, update key increases gradually. This paper proposes a new structure of the CPat tree to avoid that it takes much time in search and update about much key set, and a method to construct a new CPat tree dynamically and efficiently. This method divides a CPat tree consisting of bit string to fixed depth. In addition, it compose been divided CPAT tree hierarchically. A construction algorithm that proves this update time requires alteration of only one tree among whole trees that is divided. From experimental result that use 120,000 English substantives and 70,000 Japanese substantives, we prove an update time that is faster more than 40 times than the traditional method. Moreover, a space efficiency of memory increases about 35% only than the traditional method.  相似文献   

11.
In this paper we focus on the problem of question ranking in community question answering (cQA) forums in Arabic. We address the task with machine learning algorithms using advanced Arabic text representations. The latter are obtained by applying tree kernels to constituency parse trees combined with textual similarities, including word embeddings. Our two main contributions are: (i) an Arabic language processing pipeline based on UIMA—from segmentation to constituency parsing—built on top of Farasa, a state-of-the-art Arabic language processing toolkit; and (ii) the application of long short-term memory neural networks to identify the best text fragments in questions to be used in our tree-kernel-based ranker. Our thorough experimentation on a recently released cQA dataset shows that the Arabic linguistic processing provided by Farasa produces strong results and that neural networks combined with tree kernels further boost the performance in terms of both efficiency and accuracy. Our approach also enables an implicit comparison between different processing pipelines as our tests on Farasa and Stanford parsers demonstrate.  相似文献   

12.
[目的/意义]针对单纯使用统计自然语言处理技术对社交网络上产生的短文本数据进行意向分类时存在的特征稀疏、语义模糊和标记数据不足等问题,提出了一种融合心理语言学信息的Co-training意图分类方法。[方法/过程]首先,为丰富语义信息,在提取文本特征的同时融合带有情感倾向的心理语言学线索对特征维度进行扩展。其次,针对标记数据有限的问题,在模型训练阶段使用半监督集成法对两种机器学习分类方法(基于事件内容表达分类器与情感事件表达分类器)进行协同训练(Co-training)。最后,采用置信度乘积的投票制进行分类。[结论/结果]实验结果表明融入心理语言学信息的语料再经过协同训练的分类效果更优。  相似文献   

13.
With the rapid development of remote sensing technology, using remote sensing technology is an important means to monitor the dynamic change of land cover and ecology. In view of the complexity of mangrove ecological monitoring in Dongzhaigang, Hainan Province of China, we propose a semantic understanding method of mangrove remote sensing image by combining a multi-feature kernel sparse classifier with a decision rule model in this paper. First, on the basis of multi-feature extraction, we take into account the spatial context relations of the samples and introduce the kernel function into the sparse representation classifier, a multi-feature kernel sparse representation classifier can be constructed to classify cover types of mangroves and their surrounding objects. Second, in view of growth conditions of mangrove area, we put forward a semantic understanding method of mangrove remote sensing image based on decision rules and divide mangrove and non-mangrove areas by combining classification results of the multi-feature kernel sparse representation classifier. We make a divisibility analysis based on the extracted features of spatial and spectral domains. Then select the best split attribute based on the maximum information gain criterion, to generate a semantic tree and extract semantic rules. Finally, we work on the semantic understanding of mangrove areas in line with decision rules and further divide mangrove areas into two categories: excellent growth and poor growth. Experimental results show that the proposed method can effectively identify mangrove areas and make decisions on mangrove growth.  相似文献   

14.
A new method is described to extract significant phrases in the title and the abstract of scientific or technical documents. The method is based upon a text structure analysis and uses a relatively small dictionary. The dictionary has been constructed based on the knowledge about concepts in the field of science or technology and some lexical knowledge, for significant phrases and their component items may be used in different meanings among the fields. A text analysis approach has been applied to select significant phrases as substantial and semantic information carriers of the contents of the abstract.The results of the experiment for five sets of documents have shown that the significant phrases are effectively extracted in all cases, and the number of them for every document and the processing time is fairly satisfactory. The information representation of the document, partly using the method, is discussed with relation to the construction of the document information retrieval system.  相似文献   

15.
Word embeddings, which represent words as numerical vectors in a high-dimensional space, are contextualized by generating a unique vector representation for each sense of a word based on the surrounding words and sentence structure. They are typically generated using such deep learning models as BERT and trained on large amounts of text data and using self-supervised learning techniques. Resulting embeddings are highly effective at capturing the nuances of language, and have been shown to significantly improve the performance of numerous NLP tasks. Word embeddings represent textual records of human thinking, with all the mental relations that we utilize to produce the succession of sentences that make up texts and discourses. Consequently, the distributed representation of words within embeddings ought to capture the reasoning relations that hold texts together. This paper makes its contribution to the field by proposing a benchmark for the assessment of contextualized word embeddings that probes into their capability for true contextualization by inspecting how well they capture resemblance, contrariety, comparability, identity, relations in time and space, causation, analogy, and sense disambiguation. The proposed metrics adopt a triangulation approach, so they use (1) Hume’s reasoning relations, (2) standard analogy, and (3) sense disambiguation. The benchmark has been evaluated against 22 Arabic contextualized embeddings and has proven to be capable of quantifying their differential performance in terms of these reasoning relations. Results of evaluation of the target embeddings revealed that they do take context into account and that they do reasonably well in sense disambiguation but have weakness in their identification of converseness, synonymy, complementarity, and analogy. Results also show that size of an embedding has diminishing returns because the highly frequent language patterns swamp low frequency patterns. Furthermore, the suggest that future research endeavors should not be concerned with the quantity of data as much as its quality, and that it should focus more on the representativeness of data, and on model architecture, design, and training.  相似文献   

16.
This paper describes a state-of-the-art supervised, knowledge-intensive approach to the automatic identification of semantic relations between nominals in English sentences. The system employs a combination of rich and varied sets of new and previously used lexical, syntactic, and semantic features extracted from various knowledge sources such as WordNet and additional annotated corpora. The system ranked first at the third most popular SemEval 2007 Task – Classification of Semantic Relations between Nominals and achieved an F-measure of 72.4% and an accuracy of 76.3%. We also show that some semantic relations are better suited for WordNet-based models than other relations. Additionally, we make a distinction between out-of-context (regular) examples and those that require sentence context for relation identification and show that contextual data are important for the performance of a noun–noun semantic parser. Finally, learning curves show that the task difficulty varies across relations and that our learned WordNet-based representation is highly accurate so the performance results suggest the upper bound on what this representation can do.  相似文献   

17.
Most knowledge accumulated through scientific discoveries in genomics and related biomedical disciplines is buried in the vast amount of biomedical literature. Since understanding gene regulations is fundamental to biomedical research, summarizing all the existing knowledge about a gene based on literature is highly desirable to help biologists digest the literature. In this paper, we present a study of methods for automatically generating gene summaries from biomedical literature. Unlike most existing work on automatic text summarization, in which the generated summary is often a list of extracted sentences, we propose to generate a semi-structured summary which consists of sentences covering specific semantic aspects of a gene. Such a semi-structured summary is more appropriate for describing genes and poses special challenges for automatic text summarization. We propose a two-stage approach to generate such a summary for a given gene – first retrieving articles about a gene and then extracting sentences for each specified semantic aspect. We address the issue of gene name variation in the first stage and propose several different methods for sentence extraction in the second stage. We evaluate the proposed methods using a test set with 20 genes. Experiment results show that the proposed methods can generate useful semi-structured gene summaries automatically from biomedical literature, and our proposed methods outperform general purpose summarization methods. Among all the proposed methods for sentence extraction, a probabilistic language modeling approach that models gene context performs the best.  相似文献   

18.
19.
当代人工智能表征的分解方法及其问题   总被引:1,自引:0,他引:1  
分解方法是建立在形式系统之上的人工智能表征的必然选择,也是造成现阶段人工智能表征在自然语言语义理解方面各种瓶颈问题的理论根源。文章认为,句子层次结构以及三个平面的划界理论是分解方法难以实现段落或篇章语义理解的关键所在。基于词汇的语境描写方法难以突破单句限制,人工智能表征要想获得突破,就必须借助基于段落或篇章的整体性语境描写方法,而这正是分解方法所缺失的。多年来,人工智能表征取得的成就表明,脱离分解方法去谈整体性语境构建方法是不切实际的。整体性语境构建方法应当以分解方法为基础,二者之间是一脉相承而非矛盾的关系。二者的有机融合,是解决人工智能表征分解方法瓶颈的关键所在。  相似文献   

20.
Topic evolution has been described by many approaches from a macro level to a detail level, by extracting topic dynamics from text in literature and other media types. However, why the evolution happens is less studied. In this paper, we focus on whether and how the keyword semantics can invoke or affect the topic evolution. We assume that the semantic relatedness among the keywords can affect topic popularity during literature surveying and citing process, thus invoking evolution. However, the assumption is needed to be confirmed in an approach that fully considers the semantic interactions among topics. Traditional topic evolution analyses in scientometric domains cannot provide such support because of using limited semantic meanings. To address this problem, we apply the Google Word2Vec, a deep learning language model, to enhance the keywords with more complete semantic information. We further develop the semantic space as an urban geographic space. We analyze the topic evolution geographically using the measures of spatial autocorrelation, as if keywords are the changing lands in an evolving city. The keyword citations (keyword citation counts one when the paper containing this keyword obtains a citation) are used as an indicator of keyword popularity. Using the bibliographical datasets of the geographical natural hazard field, experimental results demonstrate that in some local areas, the popularity of keywords is affecting that of the surrounding keywords. However, there are no significant impacts on the evolution of all keywords. The spatial autocorrelation analysis identifies the interaction patterns (including High-High leading, High-Low suppressing) among the keywords in local areas. This approach can be regarded as an analyzing framework borrowed from geospatial modeling. Moreover, the prediction results in local areas are demonstrated to be more accurate if considering the spatial autocorrelations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号