Home >

news Help

Publication Information


Title
Japanese: 
English:Multimodal Speech Recognition Using Mouth Images from Depth Camera 
Author
Japanese: 安井 勇樹, 井上 中順, 岩野 公司, 篠田 浩一.  
English: Yuki Yasui, Nakamasa Inoue, Koji Iwano, Koichi Shinoda.  
Language English 
Journal/Book name
Japanese: 
English:Proc. APSIPA 
Volume, Number, Page         pp. 1233-1236
Published date Dec. 11, 2017 
Publisher
Japanese: 
English: 
Conference name
Japanese: 
English:APSIPA ASC 2017 
Conference site
Japanese: 
English:No. 5 Jalan Stesen Sentral, Kuala Lumpur. 
File
Official URL http://apsipa2017.org/
 
DOI https://doi.org/10.1109/APSIPA.2017.8282227
Abstract Deep learning has been proved to be effective in multimodal speech recognition using facial frontal images. In this paper, we propose a new deep learning method, a trimodal deep autoencoder, which uses not only audio signals and face images, but also depth images of faces, as the inputs. We collected continuous speech data from 20 speakers with Kinect 2.0 and used them for our evaluation. The experimental results with 10dB SNR showed that our method reduced errors by 30%, from 34.6% to 24.2% from audio-only speech recognition when SNR was 10dB. In particular, it is effective for recognizing some consonants including /k/, /t/.

©2007 Tokyo Institute of Technology All rights reserved.