Deep learning has been proved to be effective in〓multimodal speech recognition using facial frontal images. In〓this paper, we propose a new deep learning method, a trimodal〓deep autoencoder, which uses not only audio signals and face〓images, but also depth images of faces, as the inputs. We collected〓continuous speech data from 20 speakers with Kinect 2.0 and〓used them for our evaluation. The experimental results with〓10dB SNR showed that our method reduced errors by 30%,〓from 34.6% to 24.2% from audio-only speech recognition when〓SNR was 10dB. In particular, it is effective for recognizing some〓consonants including /k/, /t/.