首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Question answering websites are becoming an ever more popular knowledge sharing platform. On such websites, people may ask any type of question and then wait for someone else to answer the question. However, in this manner, askers may not obtain correct answers from appropriate experts. Recently, various approaches have been proposed to automatically find experts in question answering websites. In this paper, we propose a novel hybrid approach to effectively find experts for the category of the target question in question answering websites. Our approach considers user subject relevance, user reputation and authority of a category in finding experts. A user’s subject relevance denotes the relevance of a user’s domain knowledge to the target question. A user’s reputation is derived from the user’s historical question-answering records, while user authority is derived from link analysis. Moreover, our proposed approach has been extended to develop a question dependent approach that considers the relevance of historical questions to the target question in deriving user domain knowledge, reputation and authority. We used a dataset obtained from Yahoo! Answer Taiwan to evaluate our approach. Our experiment results show that our proposed methods outperform other conventional methods.  相似文献   

2.
Stance detection identifies a person’s evaluation of a subject, and is a crucial component for many downstream applications. In application, stance detection requires training a machine learning model on an annotated dataset and applying the model on another to predict stances of text snippets. This cross-dataset model generalization poses three central questions, which we investigate using stance classification models on 7 publicly available English Twitter datasets ranging from 297 to 48,284 instances. (1) Are stance classification models generalizable across datasets? We construct a single dataset model to train/test dataset-against-dataset, finding models do not generalize well (avg F1=0.33). (2) Can we improve the generalizability by aggregating datasets? We find a multi dataset model built on the aggregation of datasets has an improved performance (avg F1=0.69). (3) Given a model built on multiple datasets, how much additional data is required to fine-tune it? We find it challenging to ascertain a minimum number of data points due to the lack of pattern in performance. Investigating possible reasons for the choppy model performance we find that texts are not easily differentiable by stances, nor are annotations consistent within and across datasets. Our observations emphasize the need for an aggregated dataset as well as consistent labels for the generalizability of models.  相似文献   

3.
Visual Question Answering (VQA) systems have achieved great success in general scenarios. In medical domain, VQA systems are still in their infancy as the datasets are limited by scale and application scenarios. Current medical VQA datasets are designed to conduct basic analyses of medical imaging such as modalities, planes, organ systems, abnormalities, etc., aiming to provide constructive medical suggestions for doctors, containing a large number of professional terms with limited help for patients. In this paper, we introduce a new Patient-oriented Visual Question Answering (P-VQA) dataset, which builds a VQA system for patients by covering an entire treatment process including medical consultation, imaging diagnosis, clinical diagnosis, treatment advice, review, etc. P-VQA covers 20 common diseases with 2,169 medical images, 24,800 question-answering pairs, and a medical knowledge graph containing 419 entities. In terms of methodology, we propose a Medical Knowledge-based VQA Network (MKBN) to answer questions according to the images and a medical knowledge graph in our P-VQA. MKBN learns two cluster embeddings (disease-related and relation-related embeddings) according to structural characteristics of the medical knowledge graph and learns three different interactive features (image-question, image-disease, and question-relation) according to characteristics of diagnosis. For comparisons, we evaluate several state-of-the-art baselines on the P-VQA dataset as benchmarks. Experimental results on P-VQA demonstrate that MKBN achieves the state-of-the-art performance compared with baseline methods. The dataset is available at https://github.com/cs-jerhuang/P-VQA.  相似文献   

4.
OCR errors in text harm information retrieval performance. Much research has been reported on modelling and correction of Optical Character Recognition (OCR) errors. Most of the prior work employ language dependent resources or training texts in studying the nature of errors. However, not much research has been reported that focuses on improving retrieval performance from erroneous text in the absence of training data. We propose a novel approach for detecting OCR errors and improving retrieval performance from the erroneous corpus in a situation where training samples are not available to model errors. In this paper we propose a method that automatically identifies erroneous term variants in the noisy corpus, which are used for query expansion, in the absence of clean text. We employ an effective combination of contextual information and string matching techniques. Our proposed approach automatically identifies the erroneous variants of query terms and consequently leads to improvement in retrieval performance through query expansion. Our proposed approach does not use any training data or any language specific resources like thesaurus for identification of error variants. It also does not expend any knowledge about the language except that the word delimiter is blank space. We have tested our approach on erroneous Bangla (Bengali in English) and Hindi FIRE collections, and also on TREC Legal IIT CDIP and TREC 5 Confusion track English corpora. Our proposed approach has achieved statistically significant improvements over the state-of-the-art baselines on most of the datasets.  相似文献   

5.
In this era, the proliferating role of social media in our lives has popularized the posting of the short text. The short texts contain limited context with unique characteristics which makes them difficult to handle. Every day billions of short texts are produced in the form of tags, keywords, tweets, phone messages, messenger conversations social network posts, etc. The analysis of these short texts is imperative in the field of text mining and content analysis. The extraction of precise topics from large-scale short text documents is a critical and challenging task. The conventional approaches fail to obtain word co-occurrence patterns in topics due to the sparsity problem in short texts, such as text over the web, social media like Twitter, and news headlines. Therefore, in this paper, the sparsity problem is ameliorated by presenting a novel fuzzy topic modeling (FTM) approach for short text through fuzzy perspective. In this research, the local and global term frequencies are computed through a bag-of-words (BOW) model. To remove the negative impact of high dimensionality on the global term weighting, the principal component analysis is adopted; thereafter the fuzzy c-means algorithm is employed to retrieve the semantically relevant topics from the documents. The experiments are conducted over the three real-world short text datasets: the snippets dataset is in the category of small dataset whereas the other two datasets, Twitter and questions, are the bigger datasets. Experimental results show that the proposed approach discovered the topics more precisely and performed better as compared to other state-of-the-art baseline topic models such as GLTM, CSTM, LTM, LDA, Mix-gram, BTM, SATM, and DREx+LDA. The performance of FTM is also demonstrated in classification, clustering, topic coherence and execution time. FTM classification accuracy is 0.95, 0.94, 0.91, 0.89 and 0.87 on snippets dataset with 50, 75, 100, 125 and 200 number of topics. The classification accuracy of FTM on questions dataset is 0.73, 0.74, 0.70, 0.68 and 0.78 with 50, 75, 100, 125 and 200 number of topics. The classification accuracies of FTM on snippets and questions datasets are higher than state-of-the-art baseline topic models.  相似文献   

6.
7.
Question Answering (QA) systems are developed to answer human questions. In this paper, we have proposed a framework for answering definitional and factoid questions, enriched by machine learning and evolutionary methods and integrated in a web-based QA system. Our main purpose is to build new features by combining state-of-the-art features with arithmetic operators. To accomplish this goal, we have presented a Genetic Programming (GP)-based approach. The exact GP duty is to find the most promising formulas, made by a set of features and operators, which can accurately rank paragraphs, sentences, and words. We have also developed a QA system in order to test the new features. The input of our system is texts of documents retrieved by a search engine. To answer definitional questions, our system performs paragraph ranking and returns the most related paragraph. Moreover, in order to answer factoid questions, the system evaluates sentences of the filtered paragraphs ranked by the previous module of our framework. After this phase, the system extracts one or more words from the ranked sentences based on a set of hand-made patterns and ranks them to find the final answer. We have used Text Retrieval Conference (TREC) QA track questions, web data, and AQUAINT and AQUAINT-2 datasets for training and testing our system. Results show that the learned features can perform a better ranking in comparison with other evaluation formulas.  相似文献   

8.
Adversarial training is effective to train robust image classification models. To improve the robustness, existing approaches often use many propagations to generate adversarial examples, which have high time consumption. In this work, we propose an efficient adversarial training method with loss guided propagation (ATLGP) to accelerate the adversarial training process. ATLGP takes the loss value of generated adversarial examples as guidance to control the number of propagations for each training instance at different training stages, which decreases the computation while keeping the strength of generated adversarial examples. In this way, our method can achieve comparable robustness with less time than traditional training methods. It also has good generalization ability and can be easily combined with other efficient training methods. We conduct comprehensive experiments on CIFAR10 and MNIST, the standard datasets for several benchmarks. The experimental results show that ATLGP reduces 30% to 60% training time compared with other baseline methods while achieving similar robustness against various adversarial attacks. The combination of ATLGP and ATTA (an efficient adversarial training method) achieves superior acceleration potential when robustness meets high requirements. The statistical propagation in different training stages and ablation studies prove the effectiveness of applying loss guided propagation for each training instance. The acceleration technique can more easily extend adversarial training methods to large-scale datasets and more diverse model architectures such as vision transformers.  相似文献   

9.
Real-world datasets often present different types of data quality problems, such as the presence of outliers, missing values, inaccurate representations and duplicate entities. In order to identify duplicate entities, a task named Entity Resolution (ER), we may employ a variety of classification techniques. Rule-based techniques for classification have gained increasing attention from the state of the art due to the possibility of incorporating automatic learning approaches for generating Rule-Based Entity Resolution (RbER) algorithms. However, these algorithms present a series of drawbacks: i) The generation of high-quality RbER algorithms usually require high computational and/or manual labeling costs; ii) the impossibility of tuning RbER algorithm parameters; iii) the inability to incorporate user preferences regarding the ER results in the algorithm functioning; and iv) the logical (binary) nature of the RbER algorithms usually fall short when tackling special cases, i.e., challenging duplicate and non-duplicate pairs of entities. To overcome these drawbacks, we propose Rule Assembler, a configurable approach that classifies duplicate entities based on confidence scores produced by logical rules, taking into account tunable parameters as well as user preferences. Experiments carried out using both real-world and synthetic datasets have demonstrated the ability of the proposed approach to enhance the results produced by baseline RbER algorithms and basic assembling approaches. Furthermore, we demonstrate that the proposed approach does not entail a significant overhead over the classification step and conclude that the Rule Assembler parameters APA, WPA, TβM and Max are more suitable to be used in practical scenarios.  相似文献   

10.
While image-to-image translation has been extensively studied, there are a number of limitations in existing methods designed for transformation between instances of different shapes from different domains. In this paper, a novel approach was proposed (hereafter referred to as ObjectVariedGAN) to handle geometric translation. One may encounter large and significant shape changes during image-to-image translation, especially object transfiguration. Thus, we focus on synthesizing the desired results to maintain the shape of the foreground object without requiring paired training data. Specifically, our proposed approach learns the mapping between source domains and target domains, where the shapes of objects differ significantly. Feature similarity loss is introduced to encourage generative adversarial networks (GANs) to obtain the structure attribute of objects (e.g., object segmentation masks). Additionally, to satisfy the requirement of utilizing unaligned datasets, cycle-consistency loss is combined with context-preserving loss. Our approach feeds the generator with source image(s), incorporated with the instance segmentation mask, and guides the network to generate the desired target domain output. To verify the effectiveness of proposed approach, extensive experiments are conducted on pre-processed examples from the MS-COCO datasets. A comparative summary of the findings demonstrates that ObjectVariedGAN outperforms other competing approaches, in the terms of Inception Score, Frechet Inception Distance, and human cognitive preference.  相似文献   

11.
12.
"新浪爱问"和"百度知道"这类问答服务系统的主要任务之一是对问题进行分类,以便于组织用户产生的问题数据,并进行进一步的分析处理。问答服务系统的实际应用需求对问题分类算法在分类效果、计算复杂度以及对噪声数据敏感度等方面提出了较高的要求。基于信息检索思想,本文提出一种基于类文档排名的分类算法,并从语言模型的角度对该算法进行分析和改进。通过在一个大尺度的问题数据集合进行的一系列实验,表明本文提出的算法在问题分类任务中可以取得优于传统算法的分类效果;同时,该算法计算量较小,适用于处理大规模数据,可以很好的满足问答服务系统中对于问题分类算法的要求。  相似文献   

13.
This paper addresses the blog distillation problem, that is, given a user query find the blogs that are most related to the query topic. We model each post as evidence of the relevance of a blog to the query, and use aggregation methods like Ordered Weighted Averaging (OWA) operators to combine the evidence. We show that using only highly relevant evidence (posts) for each blog can result in an effective retrieval system. We also take into account the importance of the posts in a query-based cluster and investigate its effect in the aggregation results. We use prioritized OWA operators and show that considering the importance is effective when the number of aggregated posts from each blog is high. We carry out our experiments on three different data sets (TREC07, TREC08 and TREC09) and show statistically significant improvements over state of the art model called voting model.  相似文献   

14.
Climate change has become one of the most significant crises of our time. Public opinion on climate change is influenced by social media platforms such as Twitter, often divided into believers and deniers. In this paper, we propose a framework to classify a tweet’s stance on climate change (denier/believer). Existing approaches to stance detection and classification of climate change tweets either have paid little attention to the characteristics of deniers’ tweets or often lack an appropriate architecture. However, the relevant literature reveals that the sentimental aspects and time perspective of climate change conversations on Twitter have a major impact on public attitudes and environmental orientation. Therefore, in our study, we focus on exploring the role of temporal orientation and sentiment analysis (auxiliary tasks) in detecting the attitude of tweets on climate change (main task). Our proposed framework STASY integrates word- and sentence-based feature encoders with the intra-task and shared-private attention frameworks to better encode the interactions between task-specific and shared features. We conducted our experiments on our novel curated climate change CLiCS dataset (2465 denier and 7235 believer tweets), two publicly available climate change datasets (ClimateICWSM-2022 and ClimateStance-2022), and two benchmark stance detection datasets (SemEval-2016 and COVID-19-Stance). Experiments show that our proposed approach improves stance detection performance (with an average improvement of 12.14% on our climate change dataset, 15.18% on ClimateICWSM-2022, 12.94% on ClimateStance-2022, 19.38% on SemEval-2016, and 35.01% on COVID-19-Stance in terms of average F1 scores) by benefiting from the auxiliary tasks compared to the baseline methods.  相似文献   

15.
The aim in multi-label text classification is to assign a set of labels to a given document. Previous classifier-chain and sequence-to-sequence models have been shown to have a powerful ability to capture label correlations. However, they rely heavily on the label order, while labels in multi-label data are essentially an unordered set. The performance of these approaches is therefore highly variable depending on the order in which the labels are arranged. To avoid being dependent on label order, we design a reasoning-based algorithm named Multi-Label Reasoner (ML-Reasoner) for multi-label classification. ML-Reasoner employs a binary classifier to predict all labels simultaneously and applies a novel iterative reasoning mechanism to effectively utilize the inter-label information, where each instance of reasoning takes the previously predicted likelihoods for all labels as additional input. This approach is able to utilize information between labels, while avoiding the issue of label-order sensitivity. Extensive experiments demonstrate that our method outperforms state-of-the art approaches on the challenging AAPD dataset. We also apply our reasoning module to a variety of strong neural-based base models and show that it is able to boost performance significantly in each case.  相似文献   

16.
This paper is concerned with the quality of training data in learning to rank for information retrieval. While many data selection techniques have been proposed to improve the quality of training data for classification, the study on the same issue for ranking appears to be insufficient. As pointed out in this paper, it is inappropriate to extend technologies for classification to ranking, and the development of novel technologies is sorely needed. In this paper, we study the development of such technologies. To begin with, we propose the concept of “pairwise preference consistency” (PPC) to describe the quality of a training data collection from the ranking point of view. PPC takes into consideration the ordinal relationship between documents as well as the hierarchical structure on queries and documents, which are both unique properties of ranking. Then we select a subset of the original training documents, by maximizing the PPC of the selected subset. We further propose an efficient solution to the maximization problem. Empirical results on the LETOR benchmark datasets and a web search engine dataset show that with the subset of training data selected by our approach, the performance of the learned ranking model can be significantly improved.  相似文献   

17.
The International Classification of Diseases (ICD) is a type of meta-data found in many Electronic Patient Records. Research to explore the utility of these codes in medical Information Retrieval (IR) applications is new, and many areas of investigation remain, including the question of how reliable the assignment of the codes has been. This paper proposes two uses of the ICD codes in two different contexts of search: Pseudo-Relevance Judgments (PRJ) and Pseudo-Relevance Feedback (PRF). We find that our approach to evaluate the TREC challenge runs using simulated relevance judgments has a positive correlation with the TREC official results, and our proposed technique for performing PRF based on the ICD codes significantly outperforms a traditional PRF approach. The results are found to be consistent over the two years of queries from the TREC medical test collection.  相似文献   

18.
Opinion mining is one of the most important research tasks in the information retrieval research community. With the huge volume of opinionated data available on the Web, approaches must be developed to differentiate opinion from fact. In this paper, we present a lexicon-based approach for opinion retrieval. Generally, opinion retrieval consists of two stages: relevance to the query and opinion detection. In our work, we focus on the second state which itself focusses on detecting opinionated documents . We compare the document to be analyzed with opinionated sources that contain subjective information. We hypothesize that a document with a strong similarity to opinionated sources is more likely to be opinionated itself. Typical lexicon-based approaches treat and choose their opinion sources according to their test collection, then calculate the opinion score based on the frequency of subjective terms in the document. In our work, we use different open opinion collections without any specific treatment and consider them as a reference collection. We then use language models to determine opinion scores. The analysis document and reference collection are represented by different language models (i.e., Dirichlet, Jelinek-Mercer and two-stage models). These language models are generally used in information retrieval to represent the relationship between documents and queries. However, in our study, we modify these language models to represent opinionated documents. We carry out several experiments using Text REtrieval Conference (TREC) Blogs 06 as our analysis collection and Internet Movie Data Bases (IMDB), Multi-Perspective Question Answering (MPQA) and CHESLY as our reference collection. To improve opinion detection, we study the impact of using different language models to represent the document and reference collection alongside different combinations of opinion and retrieval scores. We then use this data to deduce the best opinion detection models. Using the best models, our approach improves on the best baseline of TREC Blog (baseline4) by 30%.  相似文献   

19.
Relation extraction aims at finding meaningful relationships between two named entities from within unstructured textual content. In this paper, we define the problem of information extraction as a matrix completion problem where we employ the notion of universal schemas formed as a collection of patterns derived from open information extraction systems as well as additional features derived from grammatical clause patterns and statistical topic models. One of the challenges with earlier work that employ matrix completion methods is that such approaches require a sufficient number of observed relation instances to be able to make predictions. However, in practice there is often insufficient number of explicit evidence supporting each relation type that could be used within the matrix model. Hence, existing work suffer from a low recall. In our work, we extend the work in the state of the art by proposing novel ways of integrating two sets of features, i.e., topic models and grammatical clause structures, for alleviating the low recall problem. More specifically, we propose that it is possible to (1) employ grammatical clause information from textual sentences to serve as an implicit indication of relation type and argument similarity. The basis for this is that it is likely that similar relation types and arguments are observed within similar grammatical structures, and (2) benefit from statistical topic models to determine similarity between relation types and arguments. We employ statistical topic models to determine relation type and argument similarity based on their co-occurrence within the same topics. We have performed extensive experiments based on both gold standard and silver standard datasets. The experiments show that our approach has been able to address the low recall problem in existing methods, by showing an improvement of 21% on recall and 8% on f-measure over the state of the art baseline.  相似文献   

20.
Brain–computer interface (BCI) is a promising intelligent healthcare technology to improve human living quality across the lifespan, which enables assistance of movement and communication, rehabilitation of exercise and nerves, monitoring sleep quality, fatigue and emotion. Most BCI systems are based on motor imagery electroencephalogram (MI-EEG) due to its advantages of sensory organs affection, operation at free will and etc. However, MI-EEG classification, a core problem in BCI systems, suffers from two critical challenges: the EEG signal’s temporal non-stationarity and the nonuniform information distribution over different electrode channels. To address these two challenges, this paper proposes TCACNet, a temporal and channel attention convolutional network for MI-EEG classification. TCACNet leverages a novel attention mechanism module and a well-designed network architecture to process the EEG signals. The former enables the TCACNet to pay more attention to signals of task-related time slices and electrode channels, supporting the latter to make accurate classification decisions. We compare the proposed TCACNet with other state-of-the-art deep learning baselines on two open source EEG datasets. Experimental results show that TCACNet achieves 11.4% and 7.9% classification accuracy improvement on two datasets respectively. Additionally, TCACNet achieves the same accuracy as other baselines with about 50% less training data. In terms of classification accuracy and data efficiency, the superiority of the TCACNet over advanced baselines demonstrates its practical value for BCI systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号