Thousands of videos are constantly being uploaded to the web, creating
a vast resource, and an ever-growing demand for methods to make them
easier to index, search, and retrieve. While visual information is a
very important part of a video, acoustic and speech information often
complements it. State of the art "content-based video retrieval"
(CBVR) research faces several challenges: how to robustly and
efficiently process large amounts of data, how to train classifiers
and segmenters on unlabeled data, how to represent and then fuse
information across modalities, how to include human feedback, etc.
Thanks to the advancement of computation technology, many of the
statistical approaches we originally developed for speech processing
can now be readily used for CBVR. This tutorial aims to present to the
speech community the state of the art in video processing, by
discussing the most relevant tasks at NIST's TREC Video Retrieval
Evaluation (TRECVID) evaluation and workshop series
(http://trecvid.nist.gov/) We liken TRECVID's "Semantic Indexing"
(SIN) task, in which a system must identify occurrences of concepts
such as "desk", or "dancing" in a video to the word spotting approach.
We then proceed to explain more recent, and challenging tasks, such as
"Multimedia Event Detection" (MED), and "Multimedia Event Recounting"
(MER), which can be compared to meeting transcription and
summarization tasks in the speech area. We will then proceed to lay
out how the speech and language community can contribute to this work,
given its own vast body of experience, and identify opportunities for
advancing speech-centric research on these datasets, whose large scale
and multi-modal nature pose unique challenges and opportunities for
future research.