首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 578 毫秒
1.
The rapid growth of documents in different languages, the increased accessibility of electronic documents, and the availability of translation tools have caused cross-lingual plagiarism detection research area to receive increasing attention in recent years. The task of cross-language plagiarism detection entails two main steps: candidate retrieval and assessing pairwise document similarity. In this paper we examine candidate retrieval, where the goal is to find potential source documents of a suspicious text. Our proposed method for cross-language plagiarism detection is a keyword-focused approach. Since plagiarism usually happens in parts of the text, there is a requirement to segment the texts into fragments to detect local similarity. Therefore we propose a topic-based segmentation algorithm to convert the suspicious document to a set of related passages. After that, we use a proximity-based model to retrieve documents with the best matching passages. Experiments show promising results for this important phase of cross-language plagiarism detection.  相似文献   

2.
Paraphrase detection is an important task in text analytics with numerous applications such as plagiarism detection, duplicate question identification, and enhanced customer support helpdesks. Deep models have been proposed for representing and classifying paraphrases. These models, however, require large quantities of human-labeled data, which is expensive to obtain. In this work, we present a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts. Our data augmentation strategy considers the notions of paraphrases and non-paraphrases as binary relations over the set of texts. Subsequently, it uses graph theoretic concepts to efficiently generate additional paraphrase and non-paraphrase pairs in a sound manner. Our multi-cascaded model employs three supervised feature learners (cascades) based on CNN and LSTM networks with and without soft-attention. The learned features, together with hand-crafted linguistic features, are then forwarded to a discriminator network for final classification. Our model is both wide and deep and provides greater robustness across clean and noisy short texts. We evaluate our approach on three benchmark datasets and show that it produces a comparable or state-of-the-art performance on all three.  相似文献   

3.
The proposed work aims to explore and compare the potency of syntactic-semantic based linguistic structures in plagiarism detection using natural language processing techniques. The current work explores linguistic features, viz., part of speech tags, chunks and semantic roles in detecting plagiarized fragments and utilizes a combined syntactic-semantic similarity metric, which extracts the semantic concepts from WordNet lexical database. The linguistic information is utilized for effective pre-processing and for availing semantically relevant comparisons. Another major contribution is the analysis of the proposed approach on plagiarism cases of various complexity levels. The impact of plagiarism types and complexity levels, upon the features extracted is analyzed and discussed. Further, unlike the existing systems, which were evaluated on some limited data sets, the proposed approach is evaluated on a larger scale using the plagiarism corpus provided by PAN1 competition from 2009 to 2014. The approach presented considerable improvement in comparison with the top-ranked systems of the respective years. The evaluation and analysis with various cases of plagiarism also reflected the supremacy of deeper linguistic features for identifying manually plagiarized data.  相似文献   

4.
Effectively detecting supportive knowledge of answers is a fundamental step towards automated question answering. While pre-trained semantic vectors for texts have enabled semantic computation for background-answer pairs, they are limited in representing structured knowledge relevant for question answering. Recent studies have shown interests in enrolling structured knowledge graphs for text processing, however, their focus was more on semantics than on graph structure. This study, by contrast, takes a special interest in exploring the structural patterns of knowledge graphs. Inspired by human cognitive processes, we propose novel methods of feature extraction for capturing the local and global structural information of knowledge graphs. These features not only exhibit good indicative power, but can also facilitate text analysis with explainable meanings. Moreover, aiming to better combine structural and semantic evidence for prediction, we propose a Neural Knowledge Graph Evaluator (NKGE) which showed superior performance over existing methods. Our contributions include a novel set of interpretable structural features and the effective NKGE for compatibility evaluation between knowledge graphs. The methods of feature extraction and the structural patterns indicated by the features may also provide insights for related studies in computational modeling and processing of knowledge.  相似文献   

5.
Image–text matching is a crucial branch in multimedia retrieval which relies on learning inter-modal correspondences. Most existing methods focus on global or local correspondence and fail to explore fine-grained global–local alignment. Moreover, the issue of how to infer more accurate similarity scores remains unresolved. In this study, we propose a novel unifying knowledge iterative dissemination and relational reconstruction (KIDRR) network for image–text matching. Particularly, the knowledge graph iterative dissemination module is designed to iteratively broadcast global semantic knowledge, enabling relevant nodes to be associated, resulting in fine-grained intra-modal correlations and features. Hence, vector-based similarity representations are learned from multiple perspectives to model multi-level alignments comprehensively. The relation graph reconstruction module is further developed to enhance cross-modal correspondences by constructing similarity relation graphs and adaptively reconstructing them. We conducted experiments on the datasets Flickr30K and MSCOCO, which have 31,783 and 123,287 images, respectively. Experiments show that KIDRR achieves improvements of nearly 2.2% and 1.6% relative to Recall@1 on Flicr30K and MSCOCO, respectively, compared to the current state-of-the-art baselines.  相似文献   

6.
Knowledge graphs are widely used in retrieval systems, question answering systems (QA), hypothesis generation systems, etc. Representation learning provides a way to mine knowledge graphs to detect missing relations; and translation-based embedding models are a popular form of representation model. Shortcomings of translation-based models however, limits their practicability as knowledge completion algorithms. The proposed model helps to address some of these shortcomings.The similarity between graph structural features of two entities was found to be correlated to the relations of those entities. This correlation can help to solve the problem caused by unbalanced relations and reciprocal relations. We used Node2vec, a graph embedding algorithm, to represent information related to an entity's graph structure, and we introduce a cascade model to incorporate graph embedding with knowledge embedding into a unified framework. The cascade model first refines feature representation in the first two stages (Local Optimization Stage), and then uses backward propagation to optimize parameters of all the stages (Global Optimization Stage). This helps to enhance the knowledge representation of existing translation-based algorithms by taking into account both semantic features and graph features and fusing them to extract more useful information. Besides, different cascade structures are designed to find the optimal solution to the problem of knowledge inference and retrieval.The proposed model was verified using three mainstream knowledge graphs: WIN18, FB15K and BioChem. Experimental results were validated using the hit@10 rate entity prediction task. The proposed model performed better than TransE, giving an average improvement of 2.7% on WN18, 2.3% on FB15k and 28% on BioChem. Improvements were particularly marked where there were problems with unbalanced relations and reciprocal relations. Furthermore, the stepwise-cascade structure is proved to be more effective and significantly outperforms other baselines.  相似文献   

7.
Nowadays, stress has become a growing problem for society due to its high impact on individuals but also on health care systems and companies. In order to overcome this problem, early detection of stress is a key factor. Previous studies have shown the effectiveness of text analysis in the detection of sentiment, emotion, and mental illness. However, existing solutions for stress detection from text are focused on a specific corpus. There is still a lack of well-validated methods that provide good results in different datasets. We aim to advance state of the art by proposing a method to detect stress in textual data and evaluating it using multiple public English datasets. The proposed approach combines lexicon-based features with distributional representations to enhance classification performance. To help organize features for stress detection in text, we propose a lexicon-based feature framework that exploits affective, syntactic, social, and topic-related features. Also, three different word embedding techniques are studied for exploiting distributional representation. Our approach has been implemented with three machine learning models that have been evaluated in terms of performance through several experiments. This evaluation has been conducted using three public English datasets and provides a baseline for other researchers. The obtained results identify the combination of FastText embeddings with a selection of lexicon-based features as the best-performing model, achieving F-scores above 80%.  相似文献   

8.
This paper presents a formalism for the representation of complex semantic relations among concepts of natural language. We define a semantic algebra as a set of atomic concepts together with an ordered set of semantic relations. Semantic trees are a graphical representation of a semantic algebra (comparable to Kantorovic trees for boolean or arithmetical expressions). A semantic tree is an ordered tree with nodes labeled with relation and concept names. We generate semantic trees from natural language texts in such a way that they represent the semantic relations which hold among the concepts occurring within that text. This generation process is carried out by a transformational grammar which transforms directly natural language sentences into semantic trees. We present an example for concepts and relations within the domain of computer science where we have generated semantic trees from definition texts by means of a metalanguage for transformational grammars (a sort of metacompiler for transformational grammars). The semantic trees generated so far serve for thesaurus entries in an information retrieval system.  相似文献   

9.
Among the various forms of academic misconduct, text recycling or ‘self-plagiarism’ holds a particularly contentious position as a new way to game the reward system of science. A recent case of alleged ‘self-plagiarism’ by the prominent Dutch economist Peter Nijkamp has attracted much public and regulatory attention in the Netherlands. During the Nijkamp controversy, it became evident that many questions around text recycling have only partly been answered and that much uncertainty still exists. While the conditions of fair text reuse have been specified more clearly in the wake of this case, the extent and causes of problematic text recycling remain unclear. In this study, we investigated the extent of problematic text recycling in order to obtain understanding of its occurrence in four research areas: biochemistry & molecular biology, economics, history and psychology. We also investigated some potential reasons and motives for authors to recycle their text, by testing current hypotheses in scholarly literature regarding the causes of text recycling. To this end, an analysis was performed on 922 journal articles, using the Turnitin plagiarism detection software, followed by close manual interpretation of the results. We observed considerable levels of problematic text recycling, particularly in economics and psychology, while it became clear that the extent of text recycling varies substantially between research fields. In addition, we found evidence that more productive authors are more likely to recycle their papers. In addition, the analysis provides insight into the influence of the number of authors and the existence of editorial policies on the occurrence of problematic text recycling.  相似文献   

10.
This paper is concerned with paraphrase detection, i.e., identifying sentences that are semantically identical. The ability to detect similar sentences written in natural language is crucial for several applications, such as text mining, text summarization, plagiarism detection, authorship authentication and question answering. Recognizing this importance, we study in particular how to address the challenges with detecting paraphrases in user generated short texts, such as Twitter, which often contain language irregularity and noise, and do not necessarily contain as much semantic information as longer clean texts. We propose a novel deep neural network-based approach that relies on coarse-grained sentence modelling using a convolutional neural network (CNN) and a recurrent neural network (RNN) model, combined with a specific fine-grained word-level similarity matching model. More specifically, we develop a new architecture, called DeepParaphrase, which enables to create an informative semantic representation of each sentence by (1) using CNN to extract the local region information in form of important n-grams from the sentence, and (2) applying RNN to capture the long-term dependency information. In addition, we perform a comparative study on state-of-the-art approaches within paraphrase detection. An important insight from this study is that existing paraphrase approaches perform well when applied on clean texts, but they do not necessarily deliver good performance against noisy texts, and vice versa. In contrast, our evaluation has shown that the proposed DeepParaphrase-based approach achieves good results in both types of texts, thus making it more robust and generic than the existing approaches.  相似文献   

11.
【目的/意义】不同学科知识的相互整合是交叉研究领域形成和发展的基础。对交叉领域知识整合过程的理解有助于促进交叉领域的知识发现和知识创新。【方法/过程】本文利用自然语言处理技术,从交叉领域文献题录数据、引文上下文和参考文献文本信息中抽取知识短语,以引证关系和词汇匹配关系为线索,提出吸纳知识和内化知识的识别方法,深入揭示交叉领域中“知识吸纳-知识内化”的知识整合过程。【结果/结论】以电子健康领域(eHealth)为例,揭示了该领域的知识内化总体特征和不同来源学科所提供知识的内化特征差异。【创新/局限】本文结合引文内容对交叉领域知识内化过程进行量化分析,提出了一种新颖的全文本量化分析方法,为理解交叉领域的知识生长过程提供了新的视角和分析方法。未来研究可以结合参考文献全文以进一步优化这一方法,同时在更多的交叉研究领域开展实证。  相似文献   

12.
Irony as a literary technique is widely used in online texts such as Twitter posts. Accurate irony detection is crucial for tasks such as effective sentiment analysis. A text’s ironic intent is defined by its context incongruity. For example in the phrase “I love being ignored”, the irony is defined by the incongruity between the positive word “love” and the negative context of “being ignored”. Existing studies mostly formulate irony detection as a standard supervised learning text categorization task, relying on explicit expressions for detecting context incongruity. In this paper we formulate irony detection instead as a transfer learning task where supervised learning on irony labeled text is enriched with knowledge transferred from external sentiment analysis resources. Importantly, we focus on identifying the hidden, implicit incongruity without relying on explicit incongruity expressions, as in “I like to think of myself as a broken down Justin Bieber – my philosophy professor.” We propose three transfer learning-based approaches to using sentiment knowledge to improve the attention mechanism of recurrent neural models for capturing hidden patterns for incongruity. Our main findings are: (1) Using sentiment knowledge from external resources is a very effective approach to improving irony detection; (2) For detecting implicit incongruity, transferring deep sentiment features seems to be the most effective way. Experiments show that our proposed models outperform state-of-the-art neural models for irony detection.  相似文献   

13.
Plagiarism is the misuse of and failure to acknowledge source materials. This paper questions common responses to the apparent increase in plagiarism by students. Internet plagiarism occurs in a context – using the Internet as an information tool – where the relevant norms are far from obvious and models of virtue are difficult to identify and perhaps impossible to find. Ethical responses to the pervasiveness of Internet-enhanced plagiarism require a reorientation of perspective on both plagiarism and the Internet as a knowledge tool. Technological strategies to “catch the cheats” send a “don’t get caught” message to students and direct the limited resources of academic institutions to a battle that cannot be won. More importantly, it is not the right battleground. Rather than characterising Internet-enabled plagiarism as a problem generated and solvable by emerging technologies, we argue that there is a more urgent need to build the background conditions that enable and sustain ethical relationships and academic virtues: to nurture an intellectual community.  相似文献   

14.
Cross-genre author profiling aims to build generalized models for predicting profile traits of authors that can be helpful across different text genres for computer forensics, marketing, and other applications. The cross-genre author profiling task becomes challenging when dealing with low-resourced languages due to the lack of availability of standard corpora and methods. The task becomes even more challenging when the data is code-switched, which is informal and unstructured. In previous studies, the problem of cross-genre author profiling has been mainly explored for mono-lingual texts in highly resourced languages (English, Spanish, etc.). However, it has not been thoroughly explored for the code-switched text which is widely used for communication over social media. To fulfill this gap, we propose a transfer learning-based solution for the cross-genre author profiling task on code-switched (English–RomanUrdu) text using three widely known genres, Facebook comments/posts, Tweets, and SMS messages. In this article, firstly, we experimented with the traditional machine learning, deep learning and pre-trained transfer learning models (MBERT, XLMRoBERTa, ULMFiT, and XLNET) for the same-genre and cross-genre gender identification task. We then propose a novel Trans-Switch approach that focuses on the code-switching nature of the text and trains on specialized language models. In addition, we developed three RomanUrdu to English translated corpora to study the impact of translation on author profiling tasks. The results show that the proposed Trans-Switch model outperforms the baseline deep learning and pre-trained transfer learning models for cross-genre author profiling task on code-switched text. Further, the experimentation also shows that the translation of RomanUrdu text does not improve results.  相似文献   

15.
Plagiarism remains at the top in terms of interest to the scientific community. In its many vicious forms, patchwork plagiarism is characterized by numerous unresolved issues and often passes “below the radar” of editors and reviewers. The problem of detecting the complexity of misconduct has been partially resolved by plagiarism detection software. However, interpretation of relevant reports is not always obvious or easy. This article deals with plagiarism in general and patchwork plagiarism in particular, as well as related problems that editors must deal with to maintain the integrity of scientific journals.  相似文献   

16.
句子级知识抽取在情报学中的应用分析   总被引:3,自引:0,他引:3  
通过比较句子级知识抽取与词语级知识抽取的差异性,分析句子级知识抽取在情报学中的意义,表现在四类典型应用系统:学术抄袭检测系统、参考文献自动标注系统、文献自动综述系统、知识库构建系统。分析了知识抽取的难点与关键技术,针对难点与关键技术提出了知识抽取的3个转向:抽取对象转向以学术文献为主;抽取技术转向以内容结构分析为主;抽取目标转向以构建知识元数据库为主。  相似文献   

17.
Humans are able to reason from multiple sources to arrive at the correct answer. In the context of Multiple Choice Question Answering (MCQA), knowledge graphs can provide subgraphs based on different combinations of questions and answers, mimicking the way humans find answers. However, current research mainly focuses on independent reasoning on a single graph for each question–answer pair, lacking the ability for joint reasoning among all answer candidates. In this paper, we propose a novel method KMSQA, which leverages multiple subgraphs from the large knowledge graph ConceptNet to model the comprehensive reasoning process. We further encode the knowledge graphs with shared Graph Neural Networks (GNNs) and perform joint reasoning across multiple subgraphs. We evaluate our model on two common datasets: CommonsenseQA (CSQA) and OpenBookQA (OBQA). Our method achieves an exact match score of 74.53% on CSQA and 71.80% on OBQA, outperforming all eight baselines.  相似文献   

18.
Knowledge graphs are sizeable graph-structured knowledge with both abstract and concrete concepts in the form of entities and relations. Recently, convolutional neural networks have achieved outstanding results for more expressive representations of knowledge graphs. However, existing deep learning-based models exploit semantic information from single-level feature interaction, potentially limiting expressiveness. We propose a knowledge graph embedding model with an attention-based high-low level features interaction convolutional network called ConvHLE to alleviate this issue. This model effectively harvests richer semantic information and generates more expressive representations. Concretely, the multilayer convolutional neural network is utilized to fuse high-low level features. Then, features in fused feature maps interact with other informative neighbors through the criss-cross attention mechanism, which expands the receptive fields and boosts the quality of interactions. Finally, a plausibility score function is proposed for the evaluation of our model. The performance of ConvHLE is experimentally investigated on six benchmark datasets with individual characteristics. Extensive experimental results prove that ConvHLE learns more expressive and discriminative feature representations and has outperformed other state-of-the-art baselines over most metrics when addressing link prediction tasks. Comparing MRR and Hits@1 on FB15K-237, our model outperforms the baseline ConvE by 13.5% and 16.0%, respectively.  相似文献   

19.
陈翠翠 《科教文汇》2013,(4):125-126
针对区域重点发展"新电子"产业的人才需求,及时把握产业结构调整和市场变化,提出了构建突出技术应用和创新能力、校企共同设计和实施的"三层递交、两线贯穿"电子专业人才培养模式,使毕业生的能力结构和知识结构能更好更快地适应地区经济社会发展的需要。  相似文献   

20.
Although commonly confused, the values inherent in copyright policy are different from those inherent in scholarly standards for proper accreditation of ideas. Piracy is the infringement of a copyright, and plagiarism is the failure to give credit. The increasing use of Web-based electron publication has created new contexts for both piracy and plagiarism. In so far as piracy and plagiarism are confused, we cannot appreciate how the Web has changed the importance of these very different types of wrongs. The present paper argues that Web-based publication lessens the importance of piracy, while it heightens the need for protections against plagiarism. Copyright policy protects the opportunity for publishers to make a profit from their investments. As the cost of publication decreases in the electronic media, we need fewer copyright protections. Plagiarism is the failure to abide by scholarly standards for citation of sources. These standards assure us that information can be verified and traced to its source. Since Web sources are often volatile and changing, it becomes increasingly difficult and important to have clear standards for verifying the source of all information.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号