Nowadays Internet traffic has been largely occupied by consumer video but most of them are not accompanied with text tags for
search. Hence, video semantic indexing, which extracts visual concepts such as objects, scenes, and actions directly from video contents,
has been intensively studied. Fundamentally, this task consists of two problems: localization and recognition. While until recently these
two problems have been studied independently, emerging end-to-end deep learning techniques using convolutional neural networks
(CNNs) and recurrent neural networks (RNNs) offer effective ways to solve them simultaneously. These techniques are deeply related
to spoken word detection techniques in the speech field. In this talk, we overview the recent progress in this area and discuss potential
directions for future research.