Multimodal Speech Recognition Using Mouth Images from Depth Camera

Yuki Yasui; Nakamasa Inoue; Koji Iwano; Koichi Shinoda

doi:https://doi.org/10.1109/APSIPA.2017.8282227

Publication Information

Title

Japanese:
English:	Multimodal Speech Recognition Using Mouth Images from Depth Camera

Author

Japanese:	安井勇樹, 井上中順, 岩野公司, 篠田浩一.
English:	Yuki Yasui, Nakamasa Inoue, Koji Iwano, Koichi Shinoda.

Language

English

Journal/Book name

Japanese:
English:	Proc. APSIPA

Volume, Number, Page

pp. 1233-1236

Published date

Dec. 11, 2017

Publisher

Japanese:
English:

Conference name

Japanese:
English:	APSIPA ASC 2017

Conference site

Japanese:
English:	No. 5 Jalan Stesen Sentral, Kuala Lumpur.

File

Official URL

http://apsipa2017.org/

DOI

https://doi.org/10.1109/APSIPA.2017.8282227

Abstract

Deep learning has been proved to be effective in multimodal speech recognition using facial frontal images. In this paper, we propose a new deep learning method, a trimodal deep autoencoder, which uses not only audio signals and face images, but also depth images of faces, as the inputs. We collected continuous speech data from 20 speakers with Kinect 2.0 and used them for our evaluation. The experimental results with 10dB SNR showed that our method reduced errors by 30%, from 34.6% to 24.2% from audio-only speech recognition when SNR was 10dB. In particular, it is effective for recognizing some consonants including /k/, /t/.

Home

Search

Support

About T2R2

Related Links

Publication Information