Home >

news ヘルプ

論文・著書情報


タイトル
和文: 
英文:ContextualCoder: Adaptive In-context Prompting for Programmatic Visual Question Answering 
著者
和文: Ruoyue Shen, Nakamasa Inoue, Dayan Guan, Rizhao Cai, Alex.C Kot, Koichi Shinoda.  
英文: Ruoyue Shen, Nakamasa Inoue, Dayan Guan, Rizhao Cai, Alex.C Kot, Koichi Shinoda.  
言語 English 
掲載誌/書名
和文: 
英文:IEEE Transactions on Multimedia (Early Access) 
巻, 号, ページ         pp. 1-14
出版年月 2025年2月17日 
出版者
和文: 
英文:IEEE 
会議名称
和文: 
英文: 
開催地
和文: 
英文: 
公式リンク https://ieeexplore.ieee.org/document/10891469
 
DOI https://doi.org/10.1109/TMM.2025.3543043
アブストラクト Visual Question Answering (VQA) presents a challenging task at the intersection of computer vision and natural language processing, aiming to bridge the semantic gap between visual perception and linguistic comprehension. Traditional VQA approaches do not distinguish between data processing and reasoning, limiting their interpretability and generalizability in complex and diverse scenarios. Conversely, Programmatic Visual Question Answering (PVQA) models leverage large language models (LLMs) to generate executable codes, providing answers with detailed and interpretable reasoning processes. However, existing PVQA models typically rely on simplistic input-output prompting, which struggles to elicit domain-specific knowledge from LLMs and often produces unclear or extraneous outputs. Furthermore, PVQA models typically rely on a basic in-context example (ICE) selection methodology that is heavily influenced by individual word similarity rather than the overall sentence context. This leads to suboptimal ICE selection and a reliance on dataset-specific ICE candidates. In this paper, we propose ContextualCoder, a novel prompting framework tailored for PVQA models. ContextualCoder leverages frozen LLMs for code generation and pre-trained visual models for code execution, eliminating the need for extensive training and enhancing model flexibility. By incorporating an innovative prompting methodology and a novel ICE selection strategy, ContextualCoder facilitates the use of diverse in-context information for code generation, thereby improving the performance of PVQA models. Our approach surpasses state-of-the-art models, as evidenced by comprehensive experiments across diverse VQA datasets, including multilingual scenarios.

©2007 Institute of Science Tokyo All rights reserved.