ContextualCoder: Adaptive In-context Prompting for Programmatic Visual Question Answering

Ruoyue Shen; Nakamasa Inoue; Dayan Guan; Rizhao Cai; Alex.C Kot; Koichi Shinoda

doi:10.1109/TMM.2025.3543043

Publication Information

Title

Japanese:
English:	ContextualCoder: Adaptive In-context Prompting for Programmatic Visual Question Answering

Author

Japanese:	Ruoyue Shen, Nakamasa Inoue, Dayan Guan, Rizhao Cai, Alex.C Kot, Koichi Shinoda.
English:	Ruoyue Shen, Nakamasa Inoue, Dayan Guan, Rizhao Cai, Alex.C Kot, Koichi Shinoda.

Language

English

Journal/Book name

Japanese:
English:	IEEE Transactions on Multimedia (Early Access)

Volume, Number, Page

pp. 1-14

Published date

Feb. 17, 2025

Publisher

Japanese:
English:	IEEE

Conference name

Japanese:
English:

Conference site

Japanese:
English:

Official URL

https://ieeexplore.ieee.org/document/10891469

DOI

https://doi.org/10.1109/TMM.2025.3543043

Abstract

Visual Question Answering (VQA) presents a challenging task at the intersection of computer vision and natural language processing, aiming to bridge the semantic gap between visual perception and linguistic comprehension. Traditional VQA approaches do not distinguish between data processing and reasoning, limiting their interpretability and generalizability in complex and diverse scenarios. Conversely, Programmatic Visual Question Answering (PVQA) models leverage large language models (LLMs) to generate executable codes, providing answers with detailed and interpretable reasoning processes. However, existing PVQA models typically rely on simplistic input-output prompting, which struggles to elicit domain-specific knowledge from LLMs and often produces unclear or extraneous outputs. Furthermore, PVQA models typically rely on a basic in-context example (ICE) selection methodology that is heavily influenced by individual word similarity rather than the overall sentence context. This leads to suboptimal ICE selection and a reliance on dataset-specific ICE candidates. In this paper, we propose ContextualCoder, a novel prompting framework tailored for PVQA models. ContextualCoder leverages frozen LLMs for code generation and pre-trained visual models for code execution, eliminating the need for extensive training and enhancing model flexibility. By incorporating an innovative prompting methodology and a novel ICE selection strategy, ContextualCoder facilitates the use of diverse in-context information for code generation, thereby improving the performance of PVQA models. Our approach surpasses state-of-the-art models, as evidenced by comprehensive experiments across diverse VQA datasets, including multilingual scenarios.

Home

Search

Support

About T2R2

Related Links

Publication Information