Text-Guided Object Detector for Multi-modal Video Question Answering

Ruoyue Shen; Nakamasa Inoue; Koichi Shinoda

doi:10.1109/WACV56688.2023.00109

論文・著書情報

タイトル

和文:
英文:	Text-Guided Object Detector for Multi-modal Video Question Answering

著者

和文:	Ruoyue Shen, 井上中順, 篠田浩一.
英文:	Ruoyue Shen, Nakamasa Inoue, Koichi Shinoda.

言語

English

掲載誌/書名

和文:
英文:	Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023

巻, 号, ページ

pp. 1032-1042

出版年月

2023年1月

出版者

和文:
英文:	IEEE

会議名称

和文:
英文:	IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023

開催地

和文:	ハワイ
英文:	Hawaii

ファイル

公式リンク

https://wacv2023.thecvf.com/home

DOI

https://doi.org/10.1109/WACV56688.2023.00109

アブストラクト

Video Question Answering (Video QA) is a task to answer a text-format question based on the understanding of linguistic semantics, visual information, and also linguisticvisual alignment in the video. In Video QA, an object detector pre-trained with large-scale datasets, such as Faster RCNN, has been widely used to extract visual representations from video frames. However, it is not always able to precisely detect the objects needed to answer the question because of the domain gaps between the datasets for training the object detector and those for Video QA. In this paper, we propose a text-guided object detector (TGOD), which takes text question-answer pairs and video frames as inputs, detects the objects relevant to the given text, and thus provides intuitive visualization and interpretable results. Our experiments using the STAGE framework on the TVQA+ dataset show the effectiveness of our proposed detector. It achieves a 2.02 points improvement in accuracy of QA, 12.13 points improvement in object detection (mAP50), 1.1 points improvement in temporal location, and 2.52 points improvement in ASA over the STAGE original detector.

Home

各種検索

サポート

T2R2について

関連リンク

論文・著書情報