Home >

news ヘルプ

論文・著書情報


タイトル
和文: 
英文:COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset a Vision-Language Benchmark 
著者
和文: 前田 航希, 橋本 敦史, 平澤 寅庄, 原島 純, Leszek Rybicki, 深澤 祐援, 牛久 祥孝.  
英文: Koki Maeda, Atsushi Hashimoto, Tosho Hirasawa, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku.  
言語 English 
掲載誌/書名
和文: 
英文: 
巻, 号, ページ        
出版年月 2024年10月 
出版者
和文: 
英文: 
会議名称
和文: 
英文:The 18th European Conference on Computer Vision ECCV 2024 
開催地
和文: 
英文:Milan 
アブストラクト Procedural video understanding is gaining attention in the vision and language community. Deep learning-based video analysis requires extensive data. Consequently, existing works often use web videos as training resources, making it challenging to query contents from raw video observations. To address this issue, we propose a new dataset, COM Kitchens. The dataset consists of unedited overhead-view videos captured by smartphones, in which participants performed food preparation based on given recipes. Fixed-viewpoint video datasets often lack environmental diversity due to high camera setup costs. We used modern wide-angle smartphone lenses to cover cooking counters from sink to cooktop in an overhead view, capturing activity without in-person assistance. With this setup, we collected a diverse dataset by distributing smartphones to participants. With this dataset, we propose the novel video-to-text retrieval task, Online Recipe Retrieval (OnRR), and new video captioning domain, Dense Video Captioning on unedited Overhead-View videos (DVC-OV). Our experiments verified the capabilities and limitations of current web-video-based SOTA methods in handling these tasks.

©2007 Institute of Science Tokyo All rights reserved.