首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Automatic text summarization attempts to provide an effective solution to today’s unprecedented growth of textual data. This paper proposes an innovative graph-based text summarization framework for generic single and multi document summarization. The summarizer benefits from two well-established text semantic representation techniques; Semantic Role Labelling (SRL) and Explicit Semantic Analysis (ESA) as well as the constantly evolving collective human knowledge in Wikipedia. The SRL is used to achieve sentence semantic parsing whose word tokens are represented as a vector of weighted Wikipedia concepts using ESA method. The essence of the developed framework is to construct a unique concept graph representation underpinned by semantic role-based multi-node (under sentence level) vertices for summarization. We have empirically evaluated the summarization system using the standard publicly available dataset from Document Understanding Conference 2002 (DUC 2002). Experimental results indicate that the proposed summarizer outperforms all state-of-the-art related comparators in the single document summarization based on the ROUGE-1 and ROUGE-2 measures, while also ranking second in the ROUGE-1 and ROUGE-SU4 scores for the multi-document summarization. On the other hand, the testing also demonstrates the scalability of the system, i.e., varying the evaluation data size is shown to have little impact on the summarizer performance, particularly for the single document summarization task. In a nutshell, the findings demonstrate the power of the role-based and vectorial semantic representation when combined with the crowd-sourced knowledge base in Wikipedia.  相似文献   

2.
Most knowledge accumulated through scientific discoveries in genomics and related biomedical disciplines is buried in the vast amount of biomedical literature. Since understanding gene regulations is fundamental to biomedical research, summarizing all the existing knowledge about a gene based on literature is highly desirable to help biologists digest the literature. In this paper, we present a study of methods for automatically generating gene summaries from biomedical literature. Unlike most existing work on automatic text summarization, in which the generated summary is often a list of extracted sentences, we propose to generate a semi-structured summary which consists of sentences covering specific semantic aspects of a gene. Such a semi-structured summary is more appropriate for describing genes and poses special challenges for automatic text summarization. We propose a two-stage approach to generate such a summary for a given gene – first retrieving articles about a gene and then extracting sentences for each specified semantic aspect. We address the issue of gene name variation in the first stage and propose several different methods for sentence extraction in the second stage. We evaluate the proposed methods using a test set with 20 genes. Experiment results show that the proposed methods can generate useful semi-structured gene summaries automatically from biomedical literature, and our proposed methods outperform general purpose summarization methods. Among all the proposed methods for sentence extraction, a probabilistic language modeling approach that models gene context performs the best.  相似文献   

3.
Searching the Internet for a certain topic can become a daunting task because users cannot read and comprehend all the resulting texts. Automatic Text summarization (ATS) in this case is clearly beneficial because manual summarization is expensive and time-consuming. To enhance ATS for single documents, this paper proposes a novel extractive graph-based framework “EdgeSumm” that relies on four proposed algorithms. The first algorithm constructs a new text graph model representation from the input document. The second and third algorithms search the constructed text graph for sentences to be included in the candidate summary. When the resulting candidate summary still exceeds a user-required limit, the fourth algorithm is used to select the most important sentences. EdgeSumm combines a set of extractive ATS methods (namely graph-based, statistical-based, semantic-based, and centrality-based methods) to benefit from their advantages and overcome their individual drawbacks. EdgeSumm is general for any document genre (not limited to a specific domain) and unsupervised so it does not require any training data. The standard datasets DUC2001 and DUC2002 are used to evaluate EdgeSumm using the widely used automatic evaluation tool: Recall-Oriented Understudy for Gisting Evaluation (ROUGE). EdgeSumm gets the highest ROUGE scores on DUC2001. For DUC2002, the evaluation results show that the proposed framework outperforms the state-of-the-art ATS systems by achieving improvements of 1.2% and 4.7% over the highest scores in the literature for the metrics of ROUGE-1 and ROUGE-L respectively. In addition, EdgeSumm achieves very competitive results for the metrics of ROUGE-2 and ROUGE-SU4.  相似文献   

4.
Today, due to a vast amount of textual data, automated extractive text summarization is one of the most common and practical techniques for organizing information. Extractive summarization selects the most appropriate sentences from the text and provide a representative summary. The sentences, as individual textual units, usually are too short for major text processing techniques to provide appropriate performance. Hence, it seems vital to bridge the gap between short text units and conventional text processing methods.In this study, we propose a semantic method for implementing an extractive multi-document summarizer system by using a combination of statistical, machine learning based, and graph-based methods. It is a language-independent and unsupervised system. The proposed framework learns the semantic representation of words from a set of given documents via word2vec method. It expands each sentence through an innovative method with the most informative and the least redundant words related to the main topic of sentence. Sentence expansion implicitly performs word sense disambiguation and tunes the conceptual densities towards the central topic of each sentence. Then, it estimates the importance of sentences by using the graph representation of the documents. To identify the most important topics of the documents, we propose an inventive clustering approach. It autonomously determines the number of clusters and their initial centroids, and clusters sentences accordingly. The system selects the best sentences from appropriate clusters for the final summary with respect to information salience, minimum redundancy, and adequate coverage.A set of extensive experiments on DUC2002 and DUC2006 datasets was conducted for investigating the proposed scheme. Experimental results showed that the proposed sentence expansion algorithm and clustering approach could considerably enhance the performance of the summarization system. Also, comparative experiments demonstrated that the proposed framework outperforms most of the state-of-the-art summarizer systems and can impressively assist the task of extractive text summarization.  相似文献   

5.
Microblogging platforms such as Twitter are increasingly used for on-line client and market analysis. This motivated the proposal of a new track at CLEF INEX lab of Tweet Contextualization. The objective of this task was to help a user to understand a tweet by providing him with a short explanatory summary (500 words). This summary should be built automatically using resources like Wikipedia and generated by extracting relevant passages and aggregating them into a coherent summary.Running for four years, results show that the best systems combine NLP techniques with more traditional methods. More precisely the best performing systems combine passage retrieval, sentence segmentation and scoring, named entity recognition, text part-of-speech (POS) analysis, anaphora detection, diversity content measure as well as sentence reordering.This paper provides a full summary report on the four-year long task. While yearly overviews focused on system results, in this paper we provide a detailed report on the approaches proposed by the participants and which can be considered as the state of the art for this task. As an important result from the 4 years competition, we also describe the open access resources that have been built and collected. The evaluation measures for automatic summarization designed in DUC or MUC were not appropriate to evaluate tweet contextualization, we explain why and depict in detailed the LogSim measure used to evaluate informativeness of produced contexts or summaries. Finally, we also mention the lessons we learned and that it is worth considering when designing a task.  相似文献   

6.
Abstractive summarization aims to generate a concise summary covering salient content from single or multiple text documents. Many recent abstractive summarization methods are built on the transformer model to capture long-range dependencies in the input text and achieve parallelization. In the transformer encoder, calculating attention weights is a crucial step for encoding input documents. Input documents usually contain some key phrases conveying salient information, and it is important to encode these phrases completely. However, existing transformer-based summarization works did not consider key phrases in input when determining attention weights. Consequently, some of the tokens within key phrases only receive small attention weights, which is not conducive to encoding the semantic information of input documents. In this paper, we introduce some prior knowledge of key phrases into the transformer-based summarization model and guide the model to encode key phrases. For the contextual representation of each token in the key phrase, we assume the tokens within the same key phrase make larger contributions compared with other tokens in the input sequence. Based on this assumption, we propose the Key Phrase Aware Transformer (KPAT), a model with the highlighting mechanism in the encoder to assign greater attention weights for tokens within key phrases. Specifically, we first extract key phrases from the input document and score the phrases’ importance. Then we build the block diagonal highlighting matrix to indicate these phrases’ importance scores and positions. To combine self-attention weights with key phrases’ importance scores, we design two structures of highlighting attention for each head and the multi-head highlighting attention. Experimental results on two datasets (Multi-News and PubMed) from different summarization tasks and domains show that our KPAT model significantly outperforms advanced summarization baselines. We conduct more experiments to analyze the impact of each part of our model on the summarization performance and verify the effectiveness of our proposed highlighting mechanism.  相似文献   

7.
The increasing volume of textual information on any topic requires its compression to allow humans to digest it. This implies detecting the most important information and condensing it. These challenges have led to new developments in the area of Natural Language Processing (NLP) and Information Retrieval (IR) such as narrative summarization and evaluation methodologies for narrative extraction. Despite some progress over recent years with several solutions for information extraction and text summarization, the problems of generating consistent narrative summaries and evaluating them are still unresolved. With regard to evaluation, manual assessment is expensive, subjective and not applicable in real time or to large collections. Moreover, it does not provide re-usable benchmarks. Nevertheless, commonly used metrics for summary evaluation still imply substantial human effort since they require a comparison of candidate summaries with a set of reference summaries. The contributions of this paper are three-fold. First, we provide a comprehensive overview of existing metrics for summary evaluation. We discuss several limitations of existing frameworks for summary evaluation. Second, we introduce an automatic framework for the evaluation of metrics that does not require any human annotation. Finally, we evaluate the existing assessment metrics on a Wikipedia data set and a collection of scientific articles using this framework. Our findings show that the majority of existing metrics based on vocabulary overlap are not suitable for assessment based on comparison with a full text and we discuss this outcome.  相似文献   

8.
The traditional machine learning systems lack a pathway for a human to integrate their domain knowledge into the underlying machine learning algorithms. The utilization of such systems, for domains where decisions can have serious consequences (e.g. medical decision-making and crime analysis), requires the incorporation of human experts' domain knowledge. The challenge, however, is how to effectively incorporate domain expert knowledge with machine learning algorithms to develop effective models for better decision making.In crime analysis, the key challenge is to identify plausible linkages in unstructured crime reports for the hypothesis formulation. Crime analysts painstakingly perform time-consuming searches of many different structured and unstructured databases to collate these associations without any proper visualization. To tackle these challenges and aiming towards facilitating the crime analysis, in this paper, we examine unstructured crime reports through text mining to extract plausible associations. Specifically, we present associative questioning based searching model to elicit multi-level associations among crime entities. We coupled this model with partition clustering to develop an interactive, human-assisted knowledge discovery and data mining scheme.The proposed human-centered knowledge discovery and data mining scheme for crime text mining is able to extract plausible associations between crimes, identifying crime pattern, grouping similar crimes, eliciting co-offender network and suspect list based on spatial-temporal and behavioral similarity. These similarities are quantified through calculating Cosine, Jacquard, and Euclidean distances. Additionally, each suspect is also ranked by a similarity score in the plausible suspect list. These associations are then visualized through creating a two-dimensional re-configurable crime cluster space along with a bipartite knowledge graph.This proposed scheme also inspects the grand challenge of integrating effective human interaction with the machine learning algorithms through a visualization feedback loop. It allows the analyst to feed his/her domain knowledge including choosing of similarity functions for identifying associations, dynamic feature selection for interactive clustering of crimes and assigning weights to each component of the crime pattern to rank suspects for an unsolved crime.We demonstrate the proposed scheme through a case study using the Anonymized burglary dataset. The scheme is found to facilitate human reasoning and analytic discourse for intelligence analysis.  相似文献   

9.
Automatic document summarization using citations is based on summarizing what others explicitly say about the document, by extracting a summary from text around the citations (citances). While this technique works quite well for summarizing the impact of scientific articles, other genres of documents as well as other types of summaries require different approaches. In this paper, we introduce a new family of methods that we developed for legal documents summarization to generate catchphrases for legal cases (where catchphrases are a form of legal summary). Our methods use both incoming and outgoing citations, and we show how citances can be combined with other elements of cited and citing documents, including the full text of the target document, and catchphrases of cited and citing cases. On a legal summarization corpus, our methods outperform competitive baselines. The combination of full text sentences and catchphrases from cited and citing cases is particularly successful. We also apply and evaluate the methods on scientific paper summarization, where they perform at the level of state-of-the-art techniques. Our family of citation-based summarization methods is powerful and flexible enough to target successfully a range of different domains and summarization tasks.  相似文献   

10.
High quality summary is the target and challenge for any automatic text summarization. In this paper, we introduce a different hybrid model for automatic text summarization problem. We exploit strengths of different techniques in building our model: we use diversity-based method to filter similar sentences and select the most diverse ones, differentiate between the more important and less important features using the swarm-based method and use fuzzy logic to make the risks, uncertainty, ambiguity and imprecise values of the text features weights flexibly tolerated. The diversity-based method focuses to reduce redundancy problems and the other two techniques concentrate on the scoring mechanism of the sentences. We presented the proposed model in two forms. In the first form of the model, diversity measures dominate the behavior of the model. In the second form, the diversity constraint is no longer imposed on the model behavior. That means the diversity-based method works same as fuzzy swarm-based method. The results showed that the proposed model in the second form performs better than the first form, the swarm model, the fuzzy swarm method and the benchmark methods. Over results show that combination of diversity measures, swarm techniques and fuzzy logic can generate good summary containing the most important parts in the document.  相似文献   

11.
Noise reduction through summarization for Web-page classification   总被引:1,自引:0,他引:1  
Due to a large variety of noisy information embedded in Web pages, Web-page classification is much more difficult than pure-text classification. In this paper, we propose to improve the Web-page classification performance by removing the noise through summarization techniques. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then put forward a new Web-page summarization algorithm based on Web-page layout and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that the classification algorithms (NB or SVM) augmented by any summarization approach can achieve an improvement by more than 5.0% as compared to pure-text-based classification algorithms. We further introduce an ensemble method to combine the different summarization algorithms. The ensemble summarization method achieves more than 12.0% improvement over pure-text based methods.  相似文献   

12.
Opinion summarization can facilitate user’s decision-making by mining the salient review information. However, due to the lack of sufficient annotated data, most of the early works are based on extractive methods, which restricts the performance of opinion summarization. In this work, we aim to improve the informativeness of opinion summarization to provide better guidance to users. We consider the setting with only reviews without corresponding summaries, and propose an aspect-augmented model for unsupervised abstractive opinion summarization, denoted as AsU-OSum. We first employ an aspect-based sentiment analysis system to extract opinion phrases from reviews. Then, we construct a heterogeneous graph consisting of reviews and opinion clusters as nodes, which is used to enhance the Transformer-based encoder–decoder framework. Furthermore, we design a novel cascaded attention mechanism to prompt the decoder to pay more attention to the aspects that are more likely to appear in summary. During training, we introduce a sentiment accuracy reward that further enhances the learning ability of our model. We conduct comprehensive experiments on the Yelp, Amazon, and Rotten Tomatoes datasets. Automatic evaluation results show that our model is competitive and performs better than the state-of-the-art (SOTA) models on some ROUGE metrics. Human evaluation results further verify that our model can generate more informative summaries and reduce redundancy.  相似文献   

13.
This paper describes the development and testing of a novel Automatic Search Query Enhancement (ASQE) algorithm, the Wikipedia N Sub-state Algorithm (WNSSA), which utilises Wikipedia as the sole data source for prior knowledge. This algorithm is built upon the concept of iterative states and sub-states, harnessing the power of Wikipedia’s data set and link information to identify and utilise reoccurring terms to aid term selection and weighting during enhancement. This algorithm is designed to prevent query drift by making callbacks to the user’s original search intent by persisting the original query between internal states with additional selected enhancement terms. The developed algorithm has shown to improve both short and long queries by providing a better understanding of the query and available data. The proposed algorithm was compared against five existing ASQE algorithms that utilise Wikipedia as the sole data source, showing an average Mean Average Precision (MAP) improvement of 0.273 over the tested existing ASQE algorithms.  相似文献   

14.
The goal of the research here presented is to identify which aspects of Wikipedia can be exploited to support the process of automatically building Multilingual Domain Modules from textbooks. First, we have defined a representation formalism for Multilingual Domain Modules that is essential for Technology Supported Learning Systems which aim to serve a globalized society. To our knowledge, no attempt has been made at achieving domain models that consider multiple languages. Our approach combines Multilingual Educational Ontologies with Learning Objects in different languages. Wikipedia is a valuable resource to accomplish this purpose. In this scenario, we have developed LiDom Builder, a framework that uses Wikipedia as an additional knowledge base for the automatic generation of Multilingual Domain Modules from textbooks. The framework includes domain-independent term extraction methods to identify which topics of Wikipedia are related to the domain to be learnt and, also, extracts their equivalents in other languages. In order to complete the Educational Ontology, we have defined a method to extract pedagogical relationships from Wikipedia and other general-purpose knowledge bases. From this task, we highlight the extraction of relationship that will allow the sequencing of the topics in Technology Supported Learning Systems. In addition, LiDom Builder takes advantage of the structured contents of Wikipedia to identify text fragments that can be used for educational purposes, classifies them and generates their corresponding Learning Objects. The interlanguage links between topics of Wikipedia are used to create Learning Objects in other languages.  相似文献   

15.
In this paper, we present a topic discovery system aimed to reveal the implicit knowledge present in news streams. This knowledge is expressed as a hierarchy of topic/subtopics, where each topic contains the set of documents that are related to it and a summary extracted from these documents. Summaries so built are useful to browse and select topics of interest from the generated hierarchies. Our proposal consists of a new incremental hierarchical clustering algorithm, which combines both partitional and agglomerative approaches, taking the main benefits from them. Finally, a new summarization method based on Testor Theory has been proposed to build the topic summaries. Experimental results in the TDT2 collection demonstrate its usefulness and effectiveness not only as a topic detection system, but also as a classification and summarization tool.  相似文献   

16.
针对图书、期刊论文等数字文献文本特征较少而导致特征向量语义表达不够准确、分类效果差的问题,本文提出一种基于特征语义扩展的数字文献分类方法。该方法首先利用TF-IDF方法获取对数字文献文本表示能力较强、具有较高TF-IDF值的核心特征词;其次分别借助知网(Hownet)语义词典以及开放知识库维基百科(Wikipedia)对核心特征词集进行语义概念的扩展,以构建维度较低、语义丰富的概念向量空间;最后采用MaxEnt、SVM等多种算法构造分类器实现对数字文献的自动分类。实验结果表明:相比传统基于特征选择的短文本分类方法,该方法能有效地实现对短文本特征的语义扩展,提高数字文献分类的分类性能。  相似文献   

17.
Increases in the amount of text resources available via the Internet has amplified the need for automated document summarizing tools. However, further efforts are needed in order to improve the quality of the existing summarization tools currently available. The current study proposes Karcı Summarization, a novel methodology for extractive, generic summarization of text documents. Karcı Entropy was used for the first time in a document summarization method within a unique approach. An important feature of the proposed system is that it does not require any kind of information source or training data. At the stage of presenting the input text, a tool for text processing was introduced; known as KUSH (named after its authors; Karcı, Uçkan, Seyyarer, and Hark), and is used to protect semantic consistency between sentences. The Karcı Entropy-based solution chooses the most effective, generic and most informational sentences within a paragraph or unit of text. Experimentation with the Karcı Summarization approach was tested using open-access document text (Document Understanding Conference; DUC-2002, DUC-2004) datasets. Performance achievement of the Karcı Summarization approach was calculated using metrics known as Recall-Oriented Understudy for Gisting Evaluation (ROUGE). The experimental results showed that the proposed summarizer outperformed all current state-of-the-art methods in terms of 200-word summaries in the metrics of ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-W-1.2. In addition, the proposed summarizer outperformed the nearest competitive summarizers by a factor of 6.4% for ROUGE-1 Recall on the DUC-2002 dataset. These results demonstrate that Karcı Summarization is a promising technique and it is therefore expected to attract interest from researchers in the field. Our approach was shown to have a high potential for adoptability. Moreover, the method was assessed as quite insensitive to disorderly and missing texts due to its KUSH text processing module.  相似文献   

18.
A well-known challenge for multi-document summarization (MDS) is that a single best or “gold standard” summary does not exist, i.e. it is often difficult to secure a consensus among reference summaries written by different authors. It therefore motivates us to study what the “important information” is in multiple input documents that will guide different authors in writing a summary. In this paper, we propose the notions of macro- and micro-level information. Macro-level information refers to the salient topics shared among different input documents, while micro-level information consists of different sentences that act as elaborating or provide complementary details for those salient topics. Experimental studies were conducted to examine the influence of macro- and micro-level information on summarization and its evaluation. Results showed that human subjects highly relied on macro-level information when writing a summary. The length allowed for summaries is the leading factor that affects the summary agreement. Meanwhile, our summarization evaluation approach based on the proposed macro- and micro-structure information also suggested that micro-level information offered complementary details for macro-level information. We believe that both levels of information form the “important information” which affects the modeling and evaluation of automatic summarization systems.  相似文献   

19.
20.
This paper considers the finite-time bipartite consensus problem governed by linear multiagent systems subject to input saturation under directed interaction topology. Due to the existence of input saturation, the dynamic performance of linear multiagent systems degrades significantly. For the improvement of the dynamic performance of systems, a dynamic gain scheduling control approach is proposed to design a dynamic Laplacian-like feedback controller, which can be obtained from the analytical solution of a parametric Lyapunov equation. Suppose that each agent is asymptotically null controllable with bounded control, and that the corresponding interaction topology of the signed directed graph with a spanning tree is structurally balanced. Then the dynamic Laplacian-like feedback control can ensure that linear multiagent systems will achieve the finite time bipartite consensus. The dynamic gain scheduling control can better improve the bipartite consensus performance of the linear multiagent systems than the static gain scheduling control. Finally, two examples are provided to show the effectiveness of the proposed control design method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号