ReGR: Relation-aware graph reasoning framework for video question answering |
| |
Institution: | 1. Department of Information Management, Dongbei University of Finance & Economics, Dalian, Liaoning, China;2. Department of Information Resources Management, Business School of Nankai University, Tianjin, China;3. Center for Network Society Governance of Nankai University, Tianjin, China;1. College of Big Data and Intelligent Engineering, Yangtze Normal University, Chongqing 408100, China;2. Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Posts and Telecommunications, Chongqing 400065, China;3. College of Computer and Information Science, Southwest University, Chongqing 400715, China |
| |
Abstract: | As one of the challenging cross-modal tasks, video question answering (VideoQA) aims to fully understand video content and answer relevant questions. The mainstream approach in current work involves extracting appearance and motion features to characterize videos separately, ignoring the interactions between them and with the question. Furthermore, some crucial semantic interaction details between visual objects are overlooked. In this paper, we propose a novel Relation-aware Graph Reasoning (ReGR) framework for video question answering, which first combines appearance–motion and location–semantic multiple interaction relations between visual objects. For the interaction between appearance and motion, we design the Appearance–Motion Block, which is question-guided to capture the interdependence between appearance and motion. For the interaction between location and semantics, we design the Location–Semantic Block, which utilizes the constructed Multi-Relation Graph Attention Network to capture the geometric position and semantic interaction between objects. Finally, the question-driven Multi-Visual Fusion captures more accurate multimodal representations. Extensive experiments on three benchmark datasets, TGIF-QA, MSVD-QA, and MSRVTT-QA, demonstrate the superiority of our proposed ReGR compared to the state-of-the-art methods. |
| |
Keywords: | Video question answering Cross-modal Graph neural network Interaction relations reasoning Attention mechanism |
本文献已被 ScienceDirect 等数据库收录! |
|