Multimodal Speech Recognition Using Mouth Images from Depth Camera

Yuki Yasui; Nakamasa Inoue; Koji Iwano; Koichi Shinoda

doi:https://doi.org/10.1109/APSIPA.2017.8282227

論文・著書情報

タイトル

和文:
英文:	Multimodal Speech Recognition Using Mouth Images from Depth Camera

著者

和文:	安井勇樹, 井上中順, 岩野公司, 篠田浩一.
英文:	Yuki Yasui, Nakamasa Inoue, Koji Iwano, Koichi Shinoda.

言語

English

掲載誌/書名

和文:
英文:	Proc. APSIPA

巻, 号, ページ

pp. 1233-1236

出版年月

2017年12月11日

出版者

和文:
英文:

会議名称

和文:
英文:	APSIPA ASC 2017

開催地

和文:
英文:	No. 5 Jalan Stesen Sentral, Kuala Lumpur.

ファイル

公式リンク

http://apsipa2017.org/

DOI

https://doi.org/10.1109/APSIPA.2017.8282227

アブストラクト

Deep learning has been proved to be effective in multimodal speech recognition using facial frontal images. In this paper, we propose a new deep learning method, a trimodal deep autoencoder, which uses not only audio signals and face images, but also depth images of faces, as the inputs. We collected continuous speech data from 20 speakers with Kinect 2.0 and used them for our evaluation. The experimental results with 10dB SNR showed that our method reduced errors by 30%, from 34.6% to 24.2% from audio-only speech recognition when SNR was 10dB. In particular, it is effective for recognizing some consonants including /k/, /t/.

Home

各種検索

サポート

T2R2について

関連リンク

論文・著書情報