ContextualCoder: Adaptive In-context Prompting for Programmatic Visual Question Answering

Ruoyue Shen; Nakamasa Inoue; Dayan Guan; Rizhao Cai; Alex.C Kot; Koichi Shinoda

doi:10.1109/TMM.2025.3543043

論文・著書情報

タイトル

和文:
英文:	ContextualCoder: Adaptive In-context Prompting for Programmatic Visual Question Answering

著者

和文:	Ruoyue Shen, Nakamasa Inoue, Dayan Guan, Rizhao Cai, Alex.C Kot, Koichi Shinoda.
英文:	Ruoyue Shen, Nakamasa Inoue, Dayan Guan, Rizhao Cai, Alex.C Kot, Koichi Shinoda.

言語

English

掲載誌/書名

和文:
英文:	IEEE Transactions on Multimedia

巻, 号, ページ

pp. 1-14

出版年月

2025年2月17日

出版者

和文:
英文:	IEEE

会議名称

和文:
英文:

開催地

和文:
英文:

ファイル

公式リンク

https://ieeexplore.ieee.org/document/10891469

DOI

https://doi.org/10.1109/TMM.2025.3543043

アブストラクト

Visual Question Answering (VQA) presents a challenging task at the intersection of computer vision and natural language processing, aiming to bridge the semantic gap between visual perception and linguistic comprehension. Traditional VQA approaches do not distinguish between data processing and reasoning, limiting their interpretability and generalizability in complex and diverse scenarios. Conversely, Programmatic Visual Question Answering (PVQA) models leverage large language models (LLMs) to generate executable codes, providing answers with detailed and interpretable reasoning processes. However, existing PVQA models typically rely on simplistic input-output prompting, which struggles to elicit domain-specific knowledge from LLMs and often produces unclear or extraneous outputs. Furthermore, PVQA models typically rely on a basic in-context example (ICE) selection methodology that is heavily influenced by individual word similarity rather than the overall sentence context. This leads to suboptimal ICE selection and a reliance on dataset-specific ICE candidates. In this paper, we propose ContextualCoder, a novel prompting framework tailored for PVQA models. ContextualCoder leverages frozen LLMs for code generation and pre-trained visual models for code execution, eliminating the need for extensive training and enhancing model flexibility. By incorporating an innovative prompting methodology and a novel ICE selection strategy, ContextualCoder facilitates the use of diverse in-context information for code generation, thereby improving the performance of PVQA models. Our approach surpasses state-of-the-art models, as evidenced by comprehensive experiments across diverse VQA datasets, including multilingual scenarios.

Home

各種検索

サポート

T2R2について

関連リンク

論文・著書情報