首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
Authorship analysis of electronic texts assists digital forensics and anti-terror investigation. Author identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors or there is a significant variation in the text-length among the available training texts of the candidate authors. Moreover, in this task usually there is no similarity between the distribution of training and test texts over the classes, that is, a basic assumption of inductive learning does not apply. In this paper, we present methods to handle imbalanced multi-class textual datasets. The main idea is to segment the training texts into text samples according to the size of the class, thus producing a fairer classification model. Hence, minority classes can be segmented into many short samples and majority classes into less and longer samples. We explore text sampling methods in order to construct a training set according to a desirable distribution over the classes. Essentially, by text sampling we provide new synthetic data that artificially increase the training size of a class. Based on two text corpora of two languages, namely, newswire stories in English and newspaper reportage in Arabic, we present a series of authorship identification experiments on various multi-class imbalanced cases that reveal the properties of the presented methods.  相似文献   

2.
Nowadays a large amount of knowledge has been born on the Internet and the way of constructing knowledge graph is not uniform. Due to the recent outbreak of numerous diseases, the community has placed more importance on the healthcare system. Diabetes is a severe disease that affect people's health. To assist the health sector in combating this deadly disease, the authors developed a deep learning strategy for diabetes named entity extraction based on a fusion of text characteristic and relationship extraction utilizing text data as the object. This study aims to develop a multi-feature entity recognition model that considers the differences in text features across different fields. Firstly, in the word embedding layer, a multi-feature word embedding algorithm is proposed, which integrates Pinyin, radical, and the meaning of the character itself, so that the word embedding vector has the characteristics of Chinese characters and diabetes text. Then in modeling, CNN and BiLSTM are used to extract the local and global features before and after the text sequence, respectively, which solved the problem that the traditional method cannot capture the dependence before and after the text sequence. Finally, CRF is used to output the predicted tag sequence. The experimental results show that the multi-feature embedding algorithm and local features extracted by CNN can effectively improve the recognition effect of the entity recognition model.  相似文献   

3.
【目的/意义】学术论文的结构功能是学术论文篇章结构和语义内容的集中体现,目前针对学术论文结构功 能的研究主要集中在对学术论文不同层次的识别以及从学科差异性视角探讨模型算法的适用性两方面,缺少模 型、学科、层次之间内在联系的比较研究。【方法/过程】选择中医学、图书情报、计算机、环境科学、植物学等学科中 文权威刊物发表的学术论文作为实验语料集,在引入CNN、LSTM、BERT等深度学习模型的基础上,分别从句子、 段落、章节内容等层次对学术论文进行结构功能识别。【结果/结论】实验结果表明,BERT模型对于不同学科学术论 文以及学术论文的不同层次的结构功能识别效果最优,各个模型对于不同学科学术论文篇章内容层次的识别效果 均最优,中医学较之其他学科的学术论文结构功能识别效果最优。此外,利用混淆矩阵给出不同学科学术论文结 构功能误识的具体情形并分析了误识原因。【创新/局限】本文研究为学术论文结构功能识别研究提供了第一手的 实证资料。  相似文献   

4.
近年尽管针对中文本文分类的研究成果不少,但基于深度学习对中文政策等长文本进行自动分类的研究还不多见。为此,借鉴和拓展传统的数据增强方法,提出集成新时代人民日报分词语料库(NEPD)、简单数据增强(EDA)算法、word2vec和文本卷积神经网络(TextCNN)的NEWT新型计算框架;实证部分,基于中国地方政府发布的科技政策文本进行算法校验。实验结果显示,在取词长度分别为500、750和1 000词的情况下,应用NEWT算法对中文科技政策文本进行分类的效果优于RCNN、Bi-LSTM和CapsNet等传统深度学习模型,F1值的平均提升比例超过13%;同时,NEWT在较短取词长度下能够实现全文输入的近似效果,可以部分改善传统深度学习模型在中文长文本自动分类任务中的计算效率。  相似文献   

5.
6.
This paper is concerned with paraphrase detection, i.e., identifying sentences that are semantically identical. The ability to detect similar sentences written in natural language is crucial for several applications, such as text mining, text summarization, plagiarism detection, authorship authentication and question answering. Recognizing this importance, we study in particular how to address the challenges with detecting paraphrases in user generated short texts, such as Twitter, which often contain language irregularity and noise, and do not necessarily contain as much semantic information as longer clean texts. We propose a novel deep neural network-based approach that relies on coarse-grained sentence modelling using a convolutional neural network (CNN) and a recurrent neural network (RNN) model, combined with a specific fine-grained word-level similarity matching model. More specifically, we develop a new architecture, called DeepParaphrase, which enables to create an informative semantic representation of each sentence by (1) using CNN to extract the local region information in form of important n-grams from the sentence, and (2) applying RNN to capture the long-term dependency information. In addition, we perform a comparative study on state-of-the-art approaches within paraphrase detection. An important insight from this study is that existing paraphrase approaches perform well when applied on clean texts, but they do not necessarily deliver good performance against noisy texts, and vice versa. In contrast, our evaluation has shown that the proposed DeepParaphrase-based approach achieves good results in both types of texts, thus making it more robust and generic than the existing approaches.  相似文献   

7.
A method is introduced to recognize the part-of-speech for English texts using knowledge of linguistic regularities rather than voluminous dictionaries. The algorithm proceeds in two steps; in the first step information concerning the part-of-speech is extracted from each word of the text in isolation using morphological analysis as well as the fact that in English there are a reasonable number of word endings which are characteristic of the part-of-speech. The second step is to look at a whole sentence and, using syntactic criteria, to assign the part-of-speech to a single word according to the parts-of-speech and other features of the surrounding words. In particular, those parts-of-speech which are relevant for automatic indexing of documents, i.e. nouns, adjectives, and verbs, are recognized. An application of this method to a large corpus of scientific text showed the result that for 84% of the words the part-of-speech was identified correctly and only for 2% definitely wrong; for the rest of the words ambiguous assignments were made. Using only word lists of a limited extent, the technique thus may be a valuable tool aiding automatic indexing of documents and automatic thesaurus construction as well as other kinds of natural language processing.  相似文献   

8.
高强度人为扰动所带来的地表破碎化使得"同物异谱"和"异物同谱"现象特别严重,仅靠地物光谱特征统计提取土地利用信息存在明显缺陷与不足,探索符合特定区域的基于多元方法耦合的土地利用信息提取模型显得非常必要,可为复杂土地利用信息提取提供便捷通道.本文首先找出典型土地利用类型的光谱特征和归一化差异型指数变化规律,再运用交互式数据语言耦合光谱信息并建立决策函数,对1993年、2001年和2009年3期TM影像进行决策树分类,以自动提取三峡库区回水淹没与快速城市化重叠区这一高强度人为扰动复合区典型土地利用信息.结果表明:典型地类的波谱响应曲线和归一化差异型指数曲线分布形态符合于地表三大基本地类特征,以此验证了研究中所选取的三种指数的合理性和各地类样区选择的有效性;各地类在TM影像1、2、3波段上因光谱特征变化趋势无明显差异、相关性较强而不易于提取,3、4、5波段光谱特征因差异变化最大、所包含的土地利用信息最为丰富而更易于提取样区典型地类;三种指数及其差值组合指数的曲线分布形态及其值变化与研究区的实际地类变化较为吻合;多元方法耦合的决策树分类比监督分类方法在提取土地利用信息上具有更高的精度,在样区土地利用自动分类中是一种低成本、易实现和推广的有效方法,从1993年、2001年和2009年总体分类精度分别提高了18.46%、15.43%和24.18%.据此,基于多元方法耦合的决策树分类模型能为三峡库区人为扰动复合区土地利用信息提取提供一可操作途径,为整个三峡库区土地利用信息自动提取提供已有方法的整合思路.  相似文献   

9.
马达  卢嘉蓉  朱侯 《情报科学》2023,41(2):60-68
【目的/意义】探究针对微博文本的基于深度学习的情绪分类有效方法,研究微博热点事件下用户转发言论的情绪类型与隐私信息传播的关系。【方法/过程】选用BERT、BERT+CNN、BERT+RNN和ERNIE四个深度学习分类模型设置对比实验,在重新构建情绪7分类语料库的基础上验证性能较好的模型。选取4个微博热点案例,从情绪分布、情感词词频、转发时间和转发次数四个方面展开实证分析。【结果/结论】通过实证研究发现,用户在传播隐私信息是急速且短暂的,传播时以“愤怒”和“厌恶”等为代表的消极情绪占主导地位,且会因隐私信息主体的不同而产生情绪类型和表达方式上的差异。【创新/局限】研究了用户在传播隐私信息行为时的情绪特征及二者的联系,为保护社交网络用户隐私信息安全提供有价值的理论和现实依据,但所构建的语料库数据量对于训练一个高准确率的深度学习模型而言还不够,且模型对于反话、反讽等文本的识别效果不佳。  相似文献   

10.
Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for accurate learning. While unlabeled documents are easily collected and plentiful, labeled documents are difficultly generated because a labeling task must be done by human developers. In this paper, we propose a new text classification method based on unsupervised or semi-supervised learning. The proposed method launches text classification tasks with only unlabeled documents and the title word of each category for learning, and then it automatically learns text classifier by using bootstrapping and feature projection techniques. The results of experiments showed that the proposed method achieved reasonably useful performance compared to a supervised method. If the proposed method is used in a text classification task, building text classification systems will become significantly faster and less expensive.  相似文献   

11.
Paraphrase detection is an important task in text analytics with numerous applications such as plagiarism detection, duplicate question identification, and enhanced customer support helpdesks. Deep models have been proposed for representing and classifying paraphrases. These models, however, require large quantities of human-labeled data, which is expensive to obtain. In this work, we present a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts. Our data augmentation strategy considers the notions of paraphrases and non-paraphrases as binary relations over the set of texts. Subsequently, it uses graph theoretic concepts to efficiently generate additional paraphrase and non-paraphrase pairs in a sound manner. Our multi-cascaded model employs three supervised feature learners (cascades) based on CNN and LSTM networks with and without soft-attention. The learned features, together with hand-crafted linguistic features, are then forwarded to a discriminator network for final classification. Our model is both wide and deep and provides greater robustness across clean and noisy short texts. We evaluate our approach on three benchmark datasets and show that it produces a comparable or state-of-the-art performance on all three.  相似文献   

12.
王倩  曾金  刘家伟  戚越 《情报科学》2020,38(3):64-69
【目的/意义】在学术大数据的应用背景下,对学术文本更加细粒度、语义化的分析挖掘日益迫切,学术文本结构功能识别成为科研领域的一个研究热点。【方法/过程】本文从段落的层次来识别章节结构功能,提出利用结合卷积神经网络和循环神经网络的特征对学术文本段落进行表达,然后进行分类。【结果/结论】文本提出的深度学习方法在整体分类结果上优于传统的机器学习方法,同时极大的减少了传统特征工程的人力需求。  相似文献   

13.
Natural Language Processing (NLP) techniques have been successfully used to automatically extract information from unstructured text through a detailed analysis of their content, often to satisfy particular information needs. In this paper, an automatic concept map construction technique, Fuzzy Association Concept Mapping (FACM), is proposed for the conversion of abstracted short texts into concept maps. The approach consists of a linguistic module and a recommendation module. The linguistic module is a text mining method that does not require the use to have any prior knowledge about using NLP techniques. It incorporates rule-based reasoning (RBR) and case based reasoning (CBR) for anaphoric resolution. It aims at extracting the propositions in text so as to construct a concept map automatically. The recommendation module is arrived at by adopting fuzzy set theories. It is an interactive process which provides suggestions of propositions for further human refinement of the automatically generated concept maps. The suggested propositions are relationships among the concepts which are not explicitly found in the paragraphs. This technique helps to stimulate individual reflection and generate new knowledge. Evaluation was carried out by using the Science Citation Index (SCI) abstract database and CNET News as test data, which are well known databases and the quality of the text is assured. Experimental results show that the automatically generated concept maps conform to the outputs generated manually by domain experts, since the degree of difference between them is proportionally small. The method provides users with the ability to convert scientific and short texts into a structured format which can be easily processed by computer. Moreover, it provides knowledge workers with extra time to re-think their written text and to view their knowledge from another angle.  相似文献   

14.
We propose a CNN-BiLSTM-Attention classifier to classify online short messages in Chinese posted by users on government web portals, so that a message can be directed to one or more government offices. Our model leverages every bit of information to carry out multi-label classification, to make use of different hierarchical text features and the labels information. In particular, our designed method extracts label meaning, the CNN layer extracts local semantic features of the texts, the BiLSTM layer fuses the contextual features of the texts and the local semantic features, and the attention layer selects the most relevant features for each label. We evaluate our model on two public large corpuses, and our high-quality handcraft e-government multi-label dataset, which is constructed by the text annotation tool doccano and consists of 29920 data points. Experimental results show that our proposed method is effective under common multi-label evaluation metrics, achieving micro-f1 of 77.22%, 84.42%, 87.52%, and marco-f1 of 77.68%, 73.37%, 83.57% on these three datasets respectively, confirming that our classifier is robust. We conduct ablation study to evaluate our label embedding method and attention mechanism. Moreover, case study on our handcraft e-government multi-label dataset verifies that our model integrates all types of semantic information of short messages based on different labels to achieve text classification.  相似文献   

15.
The automatic text summary concerns the language industries. This work proposes a system automatically and directly transforming a source text into a reduced target text. The system deals exclusively with scientific and technical texts. It is based on the identification of specific expressions allowing an evaluation of the relevance of the sentence concerned, which can then be selected for the elaboration of the summary. The procedure consists in attributing a score to each sentence of the text and then eliminating those having the lowest scores. To produce the RAFI system (automatic summary based on indicative fragments), we resorted to the linguistic means of discourse analysis and the computing capacity of data processing instruments. This system could be adapted to Internet.  相似文献   

16.
Aiming at the problem that watermark information is easy to be lost in two-dimensional text image watermarking, a three-dimensional text image watermarking model is proposed based on multilayer overlapping of extracted two-dimensional information, and a specific method is accordingly realized by means of embedding, extracting and overlapping of multiple watermarks in sequence. We discuss the influence of the sequence order, positional distancing and color difference of the extracted multiple two-dimensional information on displaying a three-dimensional text watermark image. In addition, considering the crucial evaluation indices of the invisibility and robustness for the watermarking algorithm, the selection method for superior embedding regions of multiple watermarks is also constructed to enhance the robust performance of watermarks under the premise of effective invisibility. Moreover, we embed the multiple two-dimensional information into the host image by using the undecimated dual-tree complex wavelet transforms - bidiagonal singular value decomposition algorithm. In this way, the extracted multiple two-dimensional information is automatically overlapped to generate the three-dimensional text watermarking according to the optimal matching parameters. We use standard image processing data set to carry out experiments, the results show that the peak signal to noise ratio before and after embedded watermarks is higher than 39 dB, and the peak signal to noise ratio between the original watermarks and the extracted watermarks is more than 37 dB, which is superior to the current reported watermark model. Therefore, it is suggested that the proposed model shows good invisibility and robustness and can reduce the loss of watermark during transmission and attack as much as possible.  相似文献   

17.
The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.  相似文献   

18.
Automated legal text classification is a prominent research topic in the legal field. It lays the foundation for building an intelligent legal system. Current literature focuses on international legal texts, such as Chinese cases, European cases, and Australian cases. Little attention is paid to text classification for U.S. legal texts. Deep learning has been applied to improving text classification performance. Its effectiveness needs further exploration in domains such as the legal field. This paper investigates legal text classification with a large collection of labeled U.S. case documents through comparing the effectiveness of different text classification techniques. We propose a machine learning algorithm using domain concepts as features and random forests as the classifier. Our experiment results on 30,000 full U.S. case documents in 50 categories demonstrated that our approach significantly outperforms a deep learning system built on multiple pre-trained word embeddings and deep neural networks. In addition, applying only the top 400 domain concepts as features for building the random forests could achieve the best performance. This study provides a reference to select machine learning techniques for building high-performance text classification systems in the legal domain or other fields.  相似文献   

19.
Detecting sentiments in natural language is tricky even for humans, making its automated detection more complicated. This research proffers a hybrid deep learning model for fine-grained sentiment prediction in real-time multimodal data. It reinforces the strengths of deep learning nets in combination to machine learning to deal with two specific semiotic systems, namely the textual (written text) and visual (still images) and their combination within the online content using decision level multimodal fusion. The proposed contextual ConvNet-SVMBoVW model, has four modules, namely, the discretization, text analytics, image analytics, and decision module. The input to the model is multimodal text, m ε {text, image, info-graphic}. The discretization module uses Google Lens to separate the text from the image, which is then processed as discrete entities and sent to the respective text analytics and image analytics modules. Text analytics module determines the sentiment using a hybrid of a convolution neural network (ConvNet) enriched with the contextual semantics of SentiCircle. An aggregation scheme is introduced to compute the hybrid polarity. A support vector machine (SVM) classifier trained using bag-of-visual-words (BoVW) for predicting the visual content sentiment. A Boolean decision module with a logical OR operation is augmented to the architecture which validates and categorizes the output on the basis of five fine-grained sentiment categories (truth values), namely ‘highly positive,’ ‘positive,’ ‘neutral,’ ‘negative’ and ‘highly negative.’ The accuracy achieved by the proposed model is nearly 91% which is an improvement over the accuracy obtained by the text and image modules individually.  相似文献   

20.
提出一种基于云理论和神经网络构造决策树的文本分类方法。运用云神经网络学习变量间的云映射关系,从中生成云决策树。这种方法结合了神经网络的学习算法和决策树的推理方法,具有神经网络的学习能力,并且应用了云发生器对处理不确定性的能力。更符合人类的思维方式,从而进一步提高了文本分类的效率、准确性和可靠性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号