Tokyo Tech at TRECVID 2020: Relation Modeling for Video Action Detection

Ronaldo Prata Amorim; Nakamasa Inoue; Koichi Shinoda

Publication Information

Title

Japanese:	Tokyo Tech at TRECVID 2020: Relation Modeling for Video Action Detection
English:	Tokyo Tech at TRECVID 2020: Relation Modeling for Video Action Detection

Author

Japanese:	Prata Amorim Ronaldo, 井上中順, 篠田浩一.
English:	Ronaldo Prata Amorim, Nakamasa Inoue, Koichi Shinoda.

Language

English

Journal/Book name

Japanese:
English:	TRECVID 2020 Notebook Papers

Volume, Number, Page

Published date

Dec. 8, 2020

Publisher

Japanese:
English:	TRECVID

Conference name

Japanese:
English:	TREC Video Retrieval Evaluation (TRECVID) 2020

Conference site

Japanese:
English:

File

Official URL

https://www-nlpir.nist.gov/projects/tv2020/tv20.workshop.notebook/tv20.toc.html

Abstract

We propose an action detection system for detecting human and vehicle actions in long untrimmed videos, submitted for the TRECVID Activities in Extended Video (ActEV) 2020 challenge. It utilizes an object detection and tracking stage to divide the initial video into object tracks for all possible actors, followed by action localization to temporally localize and classify all actions within these tracks. Finally, we conduct several experiments into spatial and temporal relation modeling, both showing limited performance improvement, but demonstrating the possibility of similar approaches for future video action detection research. Besides the VIRAT dataset utilized for the challenge, we utilize networks pretrained on the ImageNet and ActivityNet datasets. Summaries of the different submitted runs are as follows: • 22342 - TTA-baseline: Standard two-stage system without any relation modeling • 22442 - TTA-SRM: Same as baseline, but utilizing spatial relation modeling post-processing • 22658 - TTA-SF2: System using multiple sampling rates for temporal action localization • 22657 - TTA-SF: Same as SF2, but utilizing spatial relation modeling From the run results, we can see that utilizing the multi-sampling rate action localization slightly improves performance, while the relation modeling decreases performance, contrary to our validation experiments. This seems to indicate that our relation modeling is still premature.

Home

Search

Support

About T2R2

Related Links

Publication Information