Home >

news Help

Publication Information


Title
Japanese: 
English:Robust Video Information Retrieval using Speech Technologies 
Author
Japanese: 篠田 浩一.  
English: Koichi Shinoda.  
Language English 
Journal/Book name
Japanese: 
English: 
Volume, Number, Page        
Published date June 20, 2014 
Publisher
Japanese: 
English: 
Conference name
Japanese: 
English:Language Technologies Institute, Carnegie Mellon University 
Conference site
Japanese:5000 Forbes Ave, Pittsburgh, PA 15213 
English:5000 Forbes Ave, Pittsburgh, PA 15213 
Abstract Lecture 1: Speaker adaptation techniques for speech recognition Speaker adaptation techniques were extensively studied in early 90's and has been still one of the essential techniques in automatic speech recognition. They belong to one type of transfer learning, in which the parameters of a speaker-independent model are modified so that they fit the acoustic characteristics of an individual, with a small amount of his/her utterances. These techniques are successfully applied not only to the difference of speakers, but also to those of channels, noise environments, and so on. In this lecture, we first explain fundamental speaker adaptation techniques, Maximum A Posteriori (MAP) estimation, Maximum Likelihood Linear Regression (MLLR) , Eigenvoice, and then, how they are combined with each other and with the other training techniques such as discriminative learning and with structure learning, such as Structural MAP (SMAP) or SMAPLR. We also discuss how those techniques are applied to robust speech recognition under noisy environment. Lecture 2: Robust video information retrieval using speech technologies The amount of video data on the Internet has been rapidly increasing. Those video have large variety and in most case with low quality. Robust techniques for video indexing are strongly demanded. In automatic video semantic indexing, a user submits a textual input query for a desired object or a scene to a search system, which returns video shots that include the object or scene. In this application, many techniques developed in speech research have been successfully employed. For example, a new method using Gaussian-mixture model (GMM) supervectors and support vector machines (SVMs) was recently proven to be very effective. In this method, speech technologies such as speaker verification and speaker adaptation techniques play very important roles. In this lecture, we first introduce the activities of NIST TRECVID workshop which is a showcase of the state-of-the-art video search technologies, and then, discuss several techniques such as SIFT and HOG features, Bag of Visual Words, Fisher kernel, Multi-modal framework, and Fast tree search, to achieve robustness against the variety of the Internet video.

©2007 Tokyo Institute of Technology All rights reserved.