首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Visual Question Answering (VQA) requires reasoning about the visually-grounded relations in the image and question context. A crucial aspect of solving complex questions is reliable multi-hop reasoning, i.e., dynamically learning the interplay between visual entities in each step. In this paper, we investigate the potential of the reasoning graph network on multi-hop reasoning questions, especially over 3 “hops.” We call this model QMRGT: A Question-Guided Multi-hop Reasoning Graph Network. It constructs a cross-modal interaction module (CIM) and a multi-hop reasoning graph network (MRGT) and infers an answer by dynamically updating the inter-associated instruction between two modalities. Our graph reasoning module can apply to any multi-modal model. The experiments on VQA 2.0 and GQA (in fully supervised and O.O.D settings) datasets show that both QMRGT and pre-training V&L models+MRGT lead to improvement on visual question answering tasks. Graph-based multi-hop reasoning provides an effective signal for the visual question answering challenge, both for the O.O.D and high-level reasoning questions.  相似文献   

2.
Effectively detecting supportive knowledge of answers is a fundamental step towards automated question answering. While pre-trained semantic vectors for texts have enabled semantic computation for background-answer pairs, they are limited in representing structured knowledge relevant for question answering. Recent studies have shown interests in enrolling structured knowledge graphs for text processing, however, their focus was more on semantics than on graph structure. This study, by contrast, takes a special interest in exploring the structural patterns of knowledge graphs. Inspired by human cognitive processes, we propose novel methods of feature extraction for capturing the local and global structural information of knowledge graphs. These features not only exhibit good indicative power, but can also facilitate text analysis with explainable meanings. Moreover, aiming to better combine structural and semantic evidence for prediction, we propose a Neural Knowledge Graph Evaluator (NKGE) which showed superior performance over existing methods. Our contributions include a novel set of interpretable structural features and the effective NKGE for compatibility evaluation between knowledge graphs. The methods of feature extraction and the structural patterns indicated by the features may also provide insights for related studies in computational modeling and processing of knowledge.  相似文献   

3.
This paper focuses on temporal retrieval of activities in videos via sentence queries. Given a sentence query describing an activity, temporal moment retrieval aims at localizing the temporal segment within the video that best describes the textual query. This is a general yet challenging task as it requires the comprehending of both video and language. Existing research predominantly employ coarse frame-level features as the visual representation, obfuscating the specific details (e.g., the desired objects “girl”, “cup” and action “pour”) within the video which may provide critical cues for localizing the desired moment. In this paper, we propose a novel Spatial and Language-Temporal Tensor Fusion (SLTF) approach to resolve those issues. Specifically, the SLTF method first takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features “girl”, “cup”) by spatial attention. Then we encode the sequence of the local features on consecutive frames by employing LSTM network, which can capture the motion information and interactions among these objects (e.g., the interaction “pour” involving these two objects). Meanwhile, language-temporal attention is utilized to emphasize the keywords based on moment context information. Thereafter, a tensor fusion network learns both the intra-modality and inter-modality dynamics, which can enhance the learning of moment-query representation. Therefore, our proposed two attention sub-networks can adaptively recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query for retrieving the desired moment. Experimental results on three public benchmark datasets (obtained from TACOS, Charades-STA, and DiDeMo) show that the SLTF model significantly outperforms current state-of-the-art approaches, and demonstrate the benefits produced by new technologies incorporated into SLTF.  相似文献   

4.
The rapid development of online social media makes Abusive Language Detection (ALD) a hot topic in the field of affective computing. However, most methods for ALD in social networks do not take into account the interactive relationships among user posts, which simply regard ALD as a task of text context representation learning. To solve this problem, we propose a pipeline approach that considers both the context of a post and the characteristics of interaction network in which it is posted. Specifically, our method is divided into pre-training and downstream tasks. First, to capture fine contextual features of the posts, we use Bidirectional Encoder Representation from Transformers (BERT) as Encoder to generate sentence representations. Later, we build a Relation-Special Network according to the semantic similarity between posts as well as the interaction network structural information. On this basis, we design a Relation-Special Graph Neural Network (RSGNN) to spread effective information in the interaction network and learn the classification of texts. The experiment proves that our method can effectively improve the detection effect of abusive posts over three public datasets. The results demonstrate that injecting interaction network structure into the abusive language detection task can significantly improve the detection results.  相似文献   

5.
Recent studies point out that VQA models tend to rely on the language prior in the training data to answer the questions, which prevents the VQA model from generalization on the out-of-distribution test data. To address this problem, approaches are designed to reduce the language distribution prior effect by constructing negative image–question pairs, while they cannot provide the proper visual reason for answering the question. In this paper, we present a new debiasing framework for VQA by Learning to Sample paired image–question and Prompt for given question (LSP). Specifically, we construct the negative image–question pairs with certain sampling rate to prevent the model from overly relying on the visual shortcut content. Notably, question types provide a strong hint for answering the questions. We utilize question type to constrain the sampling process for negative question–image pairs, and further learn the question type-guided prompt for better question comprehension. Extensive experiments on two public benchmarks, VQA-CP v2 and VQA v2, demonstrate that our model achieves new state-of-the-art results in overall accuracy, i.e., 61.95% and 65.26%.  相似文献   

6.
Recommender system as an effective method to reduce information overload has been widely used in the e-commerce field. Existing studies mainly capture semantic features by considering user-item interactions or behavioral history records, which ignores the sparsity of interactions and the drift of user preferences. To cope with these challenges, we introduce the recently popular Graph Neural Networks (GNN) and propose an Interest Evolution-driven Gated Neighborhood (IEGN) aggregation representation model which can capture accurate user representation and track the evolution of user interests. Specifically, in IEGN, we explicitly model the relational information between neighbor nodes by introducing the gated adaptive propagation mechanism. Then, a personalized time interval function is designed to track the evolution of user interests. In addition, a high-order convolutional pooling operation is used to capture the correlation among the short-term interaction sequence. The user preferences are predicted by the fusion of user dynamic preferences and short-term interaction features. Extensive experiments on Amazon and Alibaba datasets show that IEGN outperforms several state-of-the-art methods in recommendation tasks.  相似文献   

7.
Question categorization, which suggests one of a set of predefined categories to a user’s question according to the question’s topic or content, is a useful technique in user-interactive question answering systems. In this paper, we propose an automatic method for question categorization in a user-interactive question answering system. This method includes four steps: feature space construction, topic-wise words identification and weighting, semantic mapping, and similarity calculation. We firstly construct the feature space based on all accumulated questions and calculate the feature vector of each predefined category which contains certain accumulated questions. When a new question is posted, the semantic pattern of the question is used to identify and weigh the important words of the question. After that, the question is semantically mapped into the constructed feature space to enrich its representation. Finally, the similarity between the question and each category is calculated based on their feature vectors. The category with the highest similarity is assigned to the question. The experimental results show that our proposed method achieves good categorization precision and outperforms the traditional categorization methods on the selected test questions.  相似文献   

8.
Machine reading comprehension (MRC) is a challenging task in the field of artificial intelligence. Most existing MRC works contain a semantic matching module, either explicitly or intrinsically, to determine whether a piece of context answers a question. However, there is scant work which systematically evaluates different paradigms using semantic matching in MRC. In this paper, we conduct a systematic empirical study on semantic matching. We formulate a two-stage framework which consists of a semantic matching model and a reading model, based on pre-trained language models. We compare and analyze the effectiveness and efficiency of using semantic matching modules with different setups on four types of MRC datasets. We verify that using semantic matching before a reading model improves both the effectiveness and efficiency of MRC. Compared with answering questions by extracting information from concise context, we observe that semantic matching yields more improvements for answering questions with noisy and adversarial context. Matching coarse-grained context to questions, e.g., paragraphs, is more effective than matching fine-grained context, e.g., sentences and spans. We also find that semantic matching is helpful for answering who/where/when/what/how/which questions, whereas it decreases the MRC performance on why questions. This may imply that semantic matching helps to answer a question whose necessary information can be retrieved from a single sentence. The above observations demonstrate the advantages and disadvantages of using semantic matching in different scenarios.  相似文献   

9.
Abnormal event detection in videos plays an essential role for public security. However, most weakly supervised learning methods ignore the relationship between the complicated spatial correlations and the dynamical trends of temporal pattern in video data. In this paper, we provide a new perspective, i.e., spatial similarity and temporal consistency are adopted to construct Spatio-Temporal Graph-based CNNs (STGCNs). For the feature extraction, we use Inflated 3D (I3D) convolutional networks to extract features which can better capture appearance and motion dynamics in videos. For the spatio graph and temporal graph, each video segment is regarded as a vertex in the graph, and attention mechanism is introduced to allocate attention for each segment. For the spatial-temporal fusion graph, we propose a self-adapting weighting to fuse them. Finally, we build ranking loss and classification loss to improve the robustness of STGCNs. We evaluate the performance of STGCNs on UCF-Crime datasets (total 128 h) and ShanghaiTech datasets (total 317,398 frames) with the AUC score 84.2% and 92.3%, respectively. The experimental results also show the effectiveness and robustness with other evaluation metrics.  相似文献   

10.
As an information medium, video offers many possible retrieval and browsing modalities, far more than text, image or audio. Some of these, like searching the text of the spoken dialogue, are well developed, others like keyframe browsing tools are in their infancy, and others not yet technically achievable. For those modalities for browsing and retrieval which we cannot yet achieve we can only speculate as to how useful they will actually be, but we do not know for sure. In our work we have created a system to support multiple modalities for video browsing and retrieval including text search through the spoken dialogue, image matching against shot keyframes and object matching against segmented video objects. For the last of these, automatic segmentation and tracking of video objects is a computationally demanding problem which is not yet solved for generic natural video material, and when it is then it is expected to open up possibilities for user interaction with objects in video, including searching and browsing. In this paper we achieve object segmentation by working in a closed domain of animated cartoons. We describe an interactive user experiment on a medium-sized corpus of video where we were able to measure users’ use of video objects versus other modes of retrieval during multiple-iteration searching. Results of this experiment show that although object searching is used far less than text searching in the first iteration of a user’s search it is a popular and useful search type once an initial set of relevant shots have been found.  相似文献   

11.
Answer selection is the most complex phase of a question answering (QA) system. To solve this task, typical approaches use unsupervised methods such as computing the similarity between query and answer, optionally exploiting advanced syntactic, semantic or logic representations.  相似文献   

12.
Graph neural networks (GNNs) have shown great potential for personalized recommendation. At the core is to reorganize interaction data as a user-item bipartite graph and exploit high-order connectivity among user and item nodes to enrich their representations. While achieving great success, most existing works consider interaction graph based only on ID information, foregoing item contents from multiple modalities (e.g., visual, acoustic, and textual features of micro-video items). Distinguishing personal interests on different modalities at a granular level was not explored until recently proposed MMGCN (Wei et al., 2019). However, it simply employs GNNs on parallel interaction graphs and treats information propagated from all neighbors equally, failing to capture user preference adaptively. Hence, the obtained representations might preserve redundant, even noisy information, leading to non-robustness and suboptimal performance. In this work, we aim to investigate how to adopt GNNs on multimodal interaction graphs, to adaptively capture user preference on different modalities and offer in-depth analysis on why an item is suitable to a user. Towards this end, we propose a new Multimodal Graph Attention Network, short for MGAT, which disentangles personal interests at the granularity of modality. In particular, built upon multimodal interaction graphs, MGAT conducts information propagation within individual graphs, while leveraging the gated attention mechanism to identify varying importance scores of different modalities to user preference. As such, it is able to capture more complex interaction patterns hidden in user behaviors and provide a more accurate recommendation. Empirical results on two micro-video recommendation datasets, Tiktok and MovieLens, show that MGAT exhibits substantial improvements over the state-of-the-art baselines like NGCF (Wang, He, et al., 2019) and MMGCN (Wei et al., 2019). Further analysis on a case study illustrates how MGAT generates attentive information flow over multimodal interaction graphs.  相似文献   

13.
The problem of content-based video retrieval continues to pose a challenge to the research community, the performance of video retrieval systems being low due to the semantic gap. In this paper we consider whether taking advantage of context can aid the video retrieval process by making the prediction of relevance easier, i.e. if it is easier for a classification system to predict the relevance of a video shot under a given context, then that context has potential in also improving retrieval, since the underlying features better differentiate relevant from non-relevant video shots. We use an operational definition of context, where datasets can be split into disjoint sub-collections which reflect a particular context. Contexts considered include task difficulty and user expertise, among others. In the classification process, four main types of features are used to represent video-shots: conventional low-level visual features representing physical properties of the video shots, behavioral features which are based on user interaction with the video shots, and two different bag-of-words features obtained from the Automatic Speech Recognition from the audio of the video.  相似文献   

14.
This paper is concerned with paraphrase detection, i.e., identifying sentences that are semantically identical. The ability to detect similar sentences written in natural language is crucial for several applications, such as text mining, text summarization, plagiarism detection, authorship authentication and question answering. Recognizing this importance, we study in particular how to address the challenges with detecting paraphrases in user generated short texts, such as Twitter, which often contain language irregularity and noise, and do not necessarily contain as much semantic information as longer clean texts. We propose a novel deep neural network-based approach that relies on coarse-grained sentence modelling using a convolutional neural network (CNN) and a recurrent neural network (RNN) model, combined with a specific fine-grained word-level similarity matching model. More specifically, we develop a new architecture, called DeepParaphrase, which enables to create an informative semantic representation of each sentence by (1) using CNN to extract the local region information in form of important n-grams from the sentence, and (2) applying RNN to capture the long-term dependency information. In addition, we perform a comparative study on state-of-the-art approaches within paraphrase detection. An important insight from this study is that existing paraphrase approaches perform well when applied on clean texts, but they do not necessarily deliver good performance against noisy texts, and vice versa. In contrast, our evaluation has shown that the proposed DeepParaphrase-based approach achieves good results in both types of texts, thus making it more robust and generic than the existing approaches.  相似文献   

15.
Since meta-paths have the innate ability to capture rich structure and semantic information, meta-path-based recommendations have gained tremendous attention in recent years. However, how to composite these multi-dimensional meta-paths? How to characterize their dynamic characteristics? How to automatically learn their priority and importance to capture users' diverse and personalized preferences at the user-level granularity? These issues are pivotal yet challenging for improving both the performance and the interpretability of recommendations. To address these challenges, we propose a personalized recommendation method via Multi-Dimensional Meta-Paths Temporal Graph Probabilistic Spreading (MD-MP-TGPS). Specifically, we first construct temporal multi-dimensional graphs with full consideration of the interest drift of users, obsolescence and popularity of items, and dynamic update of interaction behavior data. Then we propose a dimension-free temporal graph probabilistic spreading framework via multi-dimensional meta-paths. Moreover, to automatically learn the priority and importance of these multi-dimensional meta-paths at the user-level granularity, we propose two boosting strategies for personalized recommendation. Finally, we conduct comprehensive experiments on two real-world datasets and the experimental results show that the proposed MD-MP-TGPS method outperforms the compared state-of-the-art methods in such performance indicators as precision, recall, F1-score, hamming distance, intra-list diversity and popularity in terms of accuracy, diversity, and novelty.  相似文献   

16.
来云 《现代情报》2017,37(11):121-124
图书馆智能化咨询问答机器人是图书馆智能化机器人中的一种重要类型,系统设计是研究的首要内容,语料技术则是其服务效能的核心要素。本文从图书馆智能化咨询问答机器人的系统设计方案、问题语料库和答案语料库的建设与来源、分类类型、语料问题的分类与扩展、个性化分析与处理等方面,对图书馆智能化咨询问答机器人系统设计与语料技术进行了研究。此项研究对于图书馆智能化咨询问答机器人的全面研究具有参考和借鉴意义。  相似文献   

17.
王日花 《情报科学》2021,39(10):76-87
【目的/意义】解决自动问答系统构建过程中数据集构建成本高的问题,以及自动问答过程中仅考虑问题或 答案本身相关性的局限。【方法/过程】提出了一种融合标注问答库和社区问答数据的数据集构建方法,构建问题关 键词-问题-答案-答案簇多层异构网络模型,并给出了基于该模型的自动问答算法。获取图书馆语料进行处理作 为实验数据,将BERT-Cos、AINN、BiMPM模型作为对比对象进行了实验与分析。【结果/结论】通过实验得到了各 模型在图书馆自动问答任务上的效果,本文所提模型在各评价指标上均优于其他模型,模型准确率达87.85%。【创 新/局限】本文提出的多数据源融合数据集构建方法和自动问答模型在问答任务中相对于已有方法具有更好的表 现,同时根据模型效果分析给出用户提问词长建议。  相似文献   

18.
Optimal answerer ranking for new questions in community question answering   总被引:1,自引:1,他引:0  
Community question answering (CQA) services that enable users to ask and answer questions have become popular on the internet. However, lots of new questions usually cannot be resolved by appropriate answerers effectively. To address this question routing task, in this paper, we treat it as a ranking problem and rank the potential answerers by the probability that they are able to solve the given new question. We utilize tensor model and topic model simultaneously to extract latent semantic relations among asker, question and answerer. Then, we propose a learning procedure based on the above models to get optimal ranking of answerers for new questions by optimizing the multi-class AUC (Area Under the ROC Curve). Experimental results on two real-world CQA datasets show that the proposed method is able to predict appropriate answerers for new questions and outperforms other state-of-the-art approaches.  相似文献   

19.
How to parse the human image to obtain the text label corresponding to the human body is a critical task for human-computer interaction. Although previous methods have significantly improved the parsing performance, the problem of parsing confusion and tiny target missing remains unresolved, which leads to errors and incomplete inference accordingly. Targeting at these drawbacks, we fuse semantic and spatial features to mine the human body information based on the Dual Pyramid Unit convolutional neural network, named as DPUNet. DPUNet is composed of Context Pyramid Unit (CPU) and Spatial Pyramid Unit (SPU). Firstly, we design the CPU to aggregate the local to global semantic information, which exports the semantic feature for eliminating the semantic confusion. To capture the tiny targets for preventing the details from missing, the SPU is proposed to incorporate the multi-scale spatial information and output the spatial feature. Finally, the features of two complementary units are fused for accurate and complete human parsing results. Our approach achieves more excellent performance than the state-of-the-art methods on single human and multiple human parsing datasets. Meanwhile, the proposed framework is efficient with a fast speed of 41.2fps.  相似文献   

20.
Human skeleton, as a compact representation of action, has attracted numerous research attentions in recent years. However, skeletal data is too sparse to fully characterize fine-grained human motions, especially for hand/finger motions with subtle local movements. Besides, without containing any information of interacted objects, skeleton is hard to identify human–object interaction actions accurately. Hence, many action recognition approaches that purely rely on skeletal data have met a bottleneck in identifying such kind of actions. In this paper, we propose an Informed Patch Enhanced HyperGraph Convolutional Network that jointly employs human pose skeleton and informed visual patches for multi-modal feature learning. Specifically, we extract five informed visual patches around head, left hand, right hand, left foot and right foot joints as the complementary visual graph vertices. These patches often exhibit many action-related semantic information, like facial expressions, hand gestures, and interacted objects with hands or feet, which can compensate the deficiency of skeletal data. This hybrid scheme can boost the performance while keeping the computation and memory load low since only five extra vertices are appended to the original graph. Evaluation on two widely used large-scale datasets for skeleton-based action recognition demonstrates the effectiveness of the proposed method compared to the state-of-the-art methods. Significant accuracy improvements are reported using X-Sub protocol on NTU RGB+D 120 dataset.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号