首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
In this paper, the task of text segmentation is approached from a topic modeling perspective. We investigate the use of two unsupervised topic models, latent Dirichlet allocation (LDA) and multinomial mixture (MM), to segment a text into semantically coherent parts. The proposed topic model based approaches consistently outperform a standard baseline method on several datasets. A major benefit of the proposed LDA based approach is that along with the segment boundaries, it outputs the topic distribution associated with each segment. This information is of potential use in applications such as segment retrieval and discourse analysis. However, the proposed approaches, especially the LDA based method, have high computational requirements. Based on an analysis of the dynamic programming (DP) algorithm typically used for segmentation, we suggest a modification to DP that dramatically speeds up the process with no loss in performance. The proposed modification to the DP algorithm is not specific to the topic models only; it is applicable to all the algorithms that use DP for the task of text segmentation.  相似文献   

3.
Ethnicity-targeted hate speech has been widely shown to influence on-the-ground inter-ethnic conflict and violence, especially in such multi-ethnic societies as Russia. Therefore, ethnicity-targeted hate speech detection in user texts is becoming an important task. However, it faces a number of unresolved problems: difficulties of reliable mark-up, informal and indirect ways of expressing negativity in user texts (such as irony, false generalization and attribution of unfavored actions to targeted groups), users’ inclination to express opposite attitudes to different ethnic groups in the same text and, finally, lack of research on languages other than English. In this work we address several of these problems in the task of ethnicity-targeted hate speech detection in Russian-language social media texts. This approach allows us to differentiate between attitudes towards different ethnic groups mentioned in the same text – a task that has never been addressed before. We use a dataset of over 2,6M user messages mentioning ethnic groups to construct a representative sample of 12K instances (ethnic group, text) that are further thoroughly annotated via a special procedure. In contrast to many previous collections that usually comprise extreme cases of toxic speech, representativity of our sample secures a realistic and, therefore, much higher proportion of subtle negativity which additionally complicates its automatic detection. We then experiment with four types of machine learning models, from traditional classifiers such as SVM to deep learning approaches, notably the recently introduced BERT architecture, and interpret their predictions in terms of various linguistic phenomena. In addition to hate speech detection with a text-level two-class approach (hate, no hate), we also justify and implement a unique instance-based three-class approach (positive, neutral, negative attitude, the latter implying hate speech). Our best results are achieved by using fine-tuned and pre-trained RuBERT combined with linguistic features, with F1-hate=0.760, F1-macro=0.833 on the text-level two-class problem comparable to previous studies, and F1-hate=0.813, F1-macro=0.824 on our unique instance-based three-class hate speech detection task. Finally, we perform error analysis, and it reveals that further improvement could be achieved by accounting for complex and creative language issues more accurately, i.e., by detecting irony and unconventional forms of obscene lexicon.  相似文献   

4.
This paper is concerned with paraphrase detection, i.e., identifying sentences that are semantically identical. The ability to detect similar sentences written in natural language is crucial for several applications, such as text mining, text summarization, plagiarism detection, authorship authentication and question answering. Recognizing this importance, we study in particular how to address the challenges with detecting paraphrases in user generated short texts, such as Twitter, which often contain language irregularity and noise, and do not necessarily contain as much semantic information as longer clean texts. We propose a novel deep neural network-based approach that relies on coarse-grained sentence modelling using a convolutional neural network (CNN) and a recurrent neural network (RNN) model, combined with a specific fine-grained word-level similarity matching model. More specifically, we develop a new architecture, called DeepParaphrase, which enables to create an informative semantic representation of each sentence by (1) using CNN to extract the local region information in form of important n-grams from the sentence, and (2) applying RNN to capture the long-term dependency information. In addition, we perform a comparative study on state-of-the-art approaches within paraphrase detection. An important insight from this study is that existing paraphrase approaches perform well when applied on clean texts, but they do not necessarily deliver good performance against noisy texts, and vice versa. In contrast, our evaluation has shown that the proposed DeepParaphrase-based approach achieves good results in both types of texts, thus making it more robust and generic than the existing approaches.  相似文献   

5.
With the increasing growth of video data, especially in cyberspace, video captioning or the representation of video data in the form of natural language has been receiving an increasing amount of interest in several applications like video retrieval, action recognition, and video understanding, to name a few. In recent years, deep neural networks have been successfully applied for the task of video captioning. However, most existing methods describe a video clip using only one sentence that may not correctly cover the semantic content of the video clip. In this paper, a new multi-sentence video captioning algorithm is proposed using a content-oriented beam search approach and a multi-stage refining method. We use a new content-oriented beam search algorithm to update the probabilities of words generated by the trained deep networks. The proposed beam search algorithm leverages the high-level semantic information of an input video using an object detector and the structural dictionary of sentences. We also use a multi-stage refining approach to remove structurally wrong sentences as well as sentences that are less related to the semantic content of the video. To this intent, a new two-branch deep neural network is proposed to measure the relevance score between a sentence and a video. We evaluated the performance of the proposed method with two popular video captioning databases and compared the results with the results of some state-of-the-art approaches. The experiments showed the superior performance of the proposed algorithm. For instance, in the MSVD database, the proposed method shows an enhancement of 6% for the best-1 sentences in comparison to the best state-of-the-art alternative.  相似文献   

6.
A bottom-up approach to sentence ordering for multi-document summarization   总被引:1,自引:0,他引:1  
Ordering information is a difficult but important task for applications generating natural language texts such as multi-document summarization, question answering, and concept-to-text generation. In multi-document summarization, information is selected from a set of source documents. However, improper ordering of information in a summary can confuse the reader and deteriorate the readability of the summary. Therefore, it is vital to properly order the information in multi-document summarization. We present a bottom-up approach to arrange sentences extracted for multi-document summarization. To capture the association and order of two textual segments (e.g. sentences), we define four criteria: chronology, topical-closeness, precedence, and succession. These criteria are integrated into a criterion by a supervised learning approach. We repeatedly concatenate two textual segments into one segment based on the criterion, until we obtain the overall segment with all sentences arranged. We evaluate the sentence orderings produced by the proposed method and numerous baselines using subjective gradings as well as automatic evaluation measures. We introduce the average continuity, an automatic evaluation measure of sentence ordering in a summary, and investigate its appropriateness for this task.  相似文献   

7.
With the explosion of multilingual content on Web, particularly in social media platforms, identification of languages present in the text is becoming an important task for various applications. While automatic language identification (ALI) in social media text is considered to be a non-trivial task due to the presence of slang words, misspellings, creative spellings and special elements such as hashtags, user mentions etc., ALI in multilingual environment becomes even more challenging task. In a highly multilingual society, code-mixing without affecting the underlying language sense has become a natural phenomenon. In such a dynamic environment, conversational text alone often fails to identify the underlying languages present in the text. This paper proposes various methods of exploiting social conversational features for enhancing ALI performance. Although social conversational features for ALI have been explored previously using methods like probabilistic language modeling, these models often fail to address issues related to code-mixing, phonetic typing, out-of-vocabulary etc. which are prevalent in a highly multilingual environment. This paper differs in the way the social conversational features are used to propose text refinement strategies that are suitable for ALI in highly multilingual environment. The contributions in this paper therefore includes the following. First, this paper analyzes the characteristics of various social conversational features by exploiting language usage patterns. Second, various methods of text refinement suitable for language identification are proposed. Third, the effects of the proposed refinement methods are investigated using various sentence level language identification frameworks. From various experimental observations over three conversational datasets collected from Facebook, Youtube and Twitter social media platforms, it is evident that our proposed method of ALI using social conversational features outperforms the baseline counterparts.  相似文献   

8.
Cross-genre author profiling aims to build generalized models for predicting profile traits of authors that can be helpful across different text genres for computer forensics, marketing, and other applications. The cross-genre author profiling task becomes challenging when dealing with low-resourced languages due to the lack of availability of standard corpora and methods. The task becomes even more challenging when the data is code-switched, which is informal and unstructured. In previous studies, the problem of cross-genre author profiling has been mainly explored for mono-lingual texts in highly resourced languages (English, Spanish, etc.). However, it has not been thoroughly explored for the code-switched text which is widely used for communication over social media. To fulfill this gap, we propose a transfer learning-based solution for the cross-genre author profiling task on code-switched (English–RomanUrdu) text using three widely known genres, Facebook comments/posts, Tweets, and SMS messages. In this article, firstly, we experimented with the traditional machine learning, deep learning and pre-trained transfer learning models (MBERT, XLMRoBERTa, ULMFiT, and XLNET) for the same-genre and cross-genre gender identification task. We then propose a novel Trans-Switch approach that focuses on the code-switching nature of the text and trains on specialized language models. In addition, we developed three RomanUrdu to English translated corpora to study the impact of translation on author profiling tasks. The results show that the proposed Trans-Switch model outperforms the baseline deep learning and pre-trained transfer learning models for cross-genre author profiling task on code-switched text. Further, the experimentation also shows that the translation of RomanUrdu text does not improve results.  相似文献   

9.
Cross-lingual semantic interoperability has drawn significant attention in recent digital library and World Wide Web research as the information in languages other than English has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish, and French, has been widely explored; however, CLIR across European languages and Oriental languages is still in the initial stage. To cross language boundary, corpus-based approach is promising to overcome the limitation of the knowledge-based and controlled vocabulary approaches but collecting parallel corpora between European language and Oriental language is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches and compare their performances in aligning English and Chinese titles of parallel documents available on the Web.  相似文献   

10.
Paraphrase detection is an important task in text analytics with numerous applications such as plagiarism detection, duplicate question identification, and enhanced customer support helpdesks. Deep models have been proposed for representing and classifying paraphrases. These models, however, require large quantities of human-labeled data, which is expensive to obtain. In this work, we present a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts. Our data augmentation strategy considers the notions of paraphrases and non-paraphrases as binary relations over the set of texts. Subsequently, it uses graph theoretic concepts to efficiently generate additional paraphrase and non-paraphrase pairs in a sound manner. Our multi-cascaded model employs three supervised feature learners (cascades) based on CNN and LSTM networks with and without soft-attention. The learned features, together with hand-crafted linguistic features, are then forwarded to a discriminator network for final classification. Our model is both wide and deep and provides greater robustness across clean and noisy short texts. We evaluate our approach on three benchmark datasets and show that it produces a comparable or state-of-the-art performance on all three.  相似文献   

11.
This paper proposes a learning approach for the merging process in multilingual information retrieval (MLIR). To conduct the learning approach, we present a number of features that may influence the MLIR merging process. These features are mainly extracted from three levels: query, document, and translation. After the feature extraction, we then use the FRank ranking algorithm to construct a merge model. To the best of our knowledge, this practice is the first attempt to use a learning-based ranking algorithm to construct a merge model for MLIR merging. In our experiments, three test collections for the task of crosslingual information retrieval (CLIR) in NTCIR3, 4, and 5 are employed to assess the performance of our proposed method. Moreover, several merging methods are also carried out for a comparison, including traditional merging methods, the 2-step merging strategy, and the merging method based on logistic regression. The experimental results show that our proposed method can significantly improve merging quality on two different types of datasets. In addition to the effectiveness, through the merge model generated by FRank, our method can further identify key factors that influence the merging process. This information might provide us more insight and understanding into MLIR merging.  相似文献   

12.
Nonlinear system identification and prediction is a complex task, and often non-parametric models such as neural networks are used in place of intricate mathematics. To that cause, recently an improved approach to nonlinear system identification using neural networks was presented in Gupta and Sinha (J. Franklin Inst. 336 (1999) 721). Therein a learning algorithm was proposed in which both the slope of the activation function at a neuron, β, and the learning rate, η, were made adaptive. The proposed algorithm assumes that η and β are independent variables. Here, we show that the slope and the learning rate are not independent in a general dynamical neural nétwork, and this should be taken into account when designing a learning algorithm. Further, relationships between η and β are developed which helps reduce the number of degrees of freedom and computational complexity in an optimisation task of training a fully adaptive neural network. Simulation results based on Gupta and Sinha (1999) and the proposed approach support the analysis.  相似文献   

13.
OCR errors in text harm information retrieval performance. Much research has been reported on modelling and correction of Optical Character Recognition (OCR) errors. Most of the prior work employ language dependent resources or training texts in studying the nature of errors. However, not much research has been reported that focuses on improving retrieval performance from erroneous text in the absence of training data. We propose a novel approach for detecting OCR errors and improving retrieval performance from the erroneous corpus in a situation where training samples are not available to model errors. In this paper we propose a method that automatically identifies erroneous term variants in the noisy corpus, which are used for query expansion, in the absence of clean text. We employ an effective combination of contextual information and string matching techniques. Our proposed approach automatically identifies the erroneous variants of query terms and consequently leads to improvement in retrieval performance through query expansion. Our proposed approach does not use any training data or any language specific resources like thesaurus for identification of error variants. It also does not expend any knowledge about the language except that the word delimiter is blank space. We have tested our approach on erroneous Bangla (Bengali in English) and Hindi FIRE collections, and also on TREC Legal IIT CDIP and TREC 5 Confusion track English corpora. Our proposed approach has achieved statistically significant improvements over the state-of-the-art baselines on most of the datasets.  相似文献   

14.
In this era, the proliferating role of social media in our lives has popularized the posting of the short text. The short texts contain limited context with unique characteristics which makes them difficult to handle. Every day billions of short texts are produced in the form of tags, keywords, tweets, phone messages, messenger conversations social network posts, etc. The analysis of these short texts is imperative in the field of text mining and content analysis. The extraction of precise topics from large-scale short text documents is a critical and challenging task. The conventional approaches fail to obtain word co-occurrence patterns in topics due to the sparsity problem in short texts, such as text over the web, social media like Twitter, and news headlines. Therefore, in this paper, the sparsity problem is ameliorated by presenting a novel fuzzy topic modeling (FTM) approach for short text through fuzzy perspective. In this research, the local and global term frequencies are computed through a bag-of-words (BOW) model. To remove the negative impact of high dimensionality on the global term weighting, the principal component analysis is adopted; thereafter the fuzzy c-means algorithm is employed to retrieve the semantically relevant topics from the documents. The experiments are conducted over the three real-world short text datasets: the snippets dataset is in the category of small dataset whereas the other two datasets, Twitter and questions, are the bigger datasets. Experimental results show that the proposed approach discovered the topics more precisely and performed better as compared to other state-of-the-art baseline topic models such as GLTM, CSTM, LTM, LDA, Mix-gram, BTM, SATM, and DREx+LDA. The performance of FTM is also demonstrated in classification, clustering, topic coherence and execution time. FTM classification accuracy is 0.95, 0.94, 0.91, 0.89 and 0.87 on snippets dataset with 50, 75, 100, 125 and 200 number of topics. The classification accuracy of FTM on questions dataset is 0.73, 0.74, 0.70, 0.68 and 0.78 with 50, 75, 100, 125 and 200 number of topics. The classification accuracies of FTM on snippets and questions datasets are higher than state-of-the-art baseline topic models.  相似文献   

15.
Real-world datasets often present different types of data quality problems, such as the presence of outliers, missing values, inaccurate representations and duplicate entities. In order to identify duplicate entities, a task named Entity Resolution (ER), we may employ a variety of classification techniques. Rule-based techniques for classification have gained increasing attention from the state of the art due to the possibility of incorporating automatic learning approaches for generating Rule-Based Entity Resolution (RbER) algorithms. However, these algorithms present a series of drawbacks: i) The generation of high-quality RbER algorithms usually require high computational and/or manual labeling costs; ii) the impossibility of tuning RbER algorithm parameters; iii) the inability to incorporate user preferences regarding the ER results in the algorithm functioning; and iv) the logical (binary) nature of the RbER algorithms usually fall short when tackling special cases, i.e., challenging duplicate and non-duplicate pairs of entities. To overcome these drawbacks, we propose Rule Assembler, a configurable approach that classifies duplicate entities based on confidence scores produced by logical rules, taking into account tunable parameters as well as user preferences. Experiments carried out using both real-world and synthetic datasets have demonstrated the ability of the proposed approach to enhance the results produced by baseline RbER algorithms and basic assembling approaches. Furthermore, we demonstrate that the proposed approach does not entail a significant overhead over the classification step and conclude that the Rule Assembler parameters APA, WPA, TβM and Max are more suitable to be used in practical scenarios.  相似文献   

16.
黄向荣 《科教文汇》2014,(10):43-44
大学语文的理想模式应该是“人文-审美-工具”的三位一体,其中人文性是第一位的和根基性的。提升大学语文教育中人文素养培养的力度,进而培养出人格完满的、全面发展的人,是这门课程所应承担的重要任务。古今中外的文学作品中处处闪烁着人性的光辉,激活经典文本中的人文性和审美性,有助于学生形成健康高尚人格和健全心性。  相似文献   

17.
Coreference resolution of geological entities is an important task in geological information mining. Although the existing generic coreference resolution models can handle geological texts, a dramatic decline in their performance can occur without sufficient domain knowledge. Due to the high diversity of geological terminology, coreference is intricately governed by the semantic and expressive structure of geological terms. In this paper, a framework CorefRoCNN based on RoBERTa and convolutional neural network (CNN) for end-to-end coreference resolution of geological entities is proposed. Firstly, the fine-tuned RoBERTa language model is used to transform words into dynamic vector representations with contextual semantic information. Second, a CNN-based multi-scale structure feature extraction module for geological terms is designed to capture the invariance of geological terms in length, internal structure, and distribution. Thirdly, we incorporate the structural feature and word embedding for further determinations of coreference relations. In addition, attention mechanisms are used to improve the ability of the model to capture valid information in geological texts with long sentence lengths. To validate the effectiveness of the model, we compared it with several state-of-the-art models on the constructed dataset. The results show that our model has the optimal performance with an average F1 value of 79.78%, which is a 1.22% improvement compared to the second-ranked method.  相似文献   

18.
This paper focuses on the identification of multiple-input single-output output-error systems with unknown time-delays. Since the time-delays are unknown, an identification model with a high dimensional and sparse parameter vector is derived based on overparameterization. Traditional identification methods cannot get sparse solutions and require a large number of observations unless the time-delays are predetermined. Inspired by the sparse optimization and the greedy algorithms, an auxiliary model based orthogonal matching pursuit iterative (AM-OMPI) algorithm is proposed by using the orthogonal matching pursuit, and then based on the gradient search, an auxiliary model based gradient pursuit iterative algorithm is proposed, which is computationally more efficient than the AM-OMPI algorithm. The proposed methods can simultaneously estimate the parameters and time-delays from a small number of sampled data. A simulation example is used to illustrate the effectiveness of the proposed algorithms.  相似文献   

19.
In-depth exploration of the knowledge linkages between science and technology (S&T) is an essential prerequisite for accurately understanding the S&T innovation laws, promoting the transformation of scientific outcomes, and optimizing S&T innovation policies. A novel deep learning-based methodology is proposed to investigate S&T linkages, where papers and patents are applied to represent science and technology. In order to accurately and comprehensively reveal the linkages between science and technology topics, the proposed framework combines the information of knowledge structure with textual semantics. Furthermore, the exploration analysis is also conducted from the perspective of realizing the optimal matching between science and technology topics, which can realize combinatorial optimization of the S&T knowledge systems. Specifically, science and technology networks are constructed based on Node2Vec and BERT. Then, science and technology topics are identified based on the Fast Unfolding algorithm and Z-Score index. Finally, a science-technology bipartite graph is constructed, the S&T topic linkages identification task is successfully transferred into a bipartite matching problem, and the maximum-weight matching is identified using a Kuhn-Munkres bipartite algorithm. An experiment on natural language processing demonstrates the feasibility and reliability of the proposed methodology.  相似文献   

20.
This paper proposes to use a hybrid Stochastic Fractal Search (SFS) and Local Unimodal Sampling (LUS) based multistage Proportional Integral Derivative (PID) controller consisting of Proportional Derivative controller with derivative Filter (PDF) plus (1 + Proportional Integral) for Automatic Generation Control (AGC) of power systems. Initially, a single area multi-source power system consisting of thermal hydro and gas power plants is considered and parameters of Integral (I) controller is optimized by Stochastic Fractal Search (SFS) algorithm. The superiority of SFS algorithm over some recently proposed approaches such as optimal control, Differential Evolution (DE) and Teaching Learning Based Optimization (TLBO) is demonstrated. To improve the system performance further, LUS is subsequently employed. The study is further extended for different controllers like PID, and proposed multistage PID controller and the superiority of multistage PID controller over conventional PID controller structure is demonstrated. The study is further extended to a two-area six unit multi-source interconnected power system and the superiority of proposed approach over, TLBO and optimal control is demonstrated. Finally the study is extended to a three unequal area system power system with appropriate nonlinearities such as Generation Rate Constraint (GRC), Governor Dead Band (GDB) and time delay. From the analysis, it is found that hybrid SFS–LUS algorithm is superior to the original SFS algorithm and substantial improvement in system performance are realized with proposed multistage PID controller over conventional PID controller structure.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号