Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection

Mariana Rodrigues Makiuchi; Tifani Warnita; Kuniaki Uto; Koichi Shinoda

doi:10.1145/3347320.3357694

論文・著書情報

タイトル

和文:	Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection
英文:	Multimodal Fusion of BERT-CNN and Gated CNN Representations for Depression Detection

著者

和文:	R Makiuchi Mariana, Warnita Tifani, 宇都有昭, 篠田浩一.
英文:	Mariana Rodrigues Makiuchi, Tifani Warnita, Kuniaki Uto, Koichi Shinoda.

言語

English

掲載誌/書名

和文:	Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop
英文:	Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

巻, 号, ページ

Page 55-63

出版年月

2019年10月

出版者

和文:
英文:	Association for Computing Machinery

会議名称

和文:
英文:	9th International Audio/Visual Emotion Challenge and Workshop (AVEC) 2019

開催地

和文:	ニース
英文:	Nice

ファイル

公式リンク

https://dl.acm.org/citation.cfm?id=3357694

DOI

https://doi.org/10.1145/3347320.3357694

アブストラクト

Depression is a common, but serious mental disorder that affects people all over the world. Besides providing an easier way of diagnosing the disorder, a computer-aided automatic depression assessment system is demanded in order to reduce subjective bias in the diagnosis. We propose a multimodal fusion of speech and linguistic representation for depression detection. We train our modelto infer the Patient Health Questionnaire (PHQ) score of subjects from AVEC 2019 DDS Challenge database, the E-DAIC corpus. For the speech modality, we use deep spectrum features extracted from a pretrained VGG-16 network and employ a Gated Convolutional Neural Network (GCNN) followed by a LSTM layer. For the textual embeddings, we extract BERT textual features and employ a Convolutional Neural Network (CNN) followed by a LSTM layer. We achieved a CCC score equivalent to 0.497 and 0.608 on the E-DAICcorpus development set using the unimodal speech and linguistic models respectively. We further combine the two modalities using a feature fusion approach in which we apply the last representationof each single modality model to a fully-connected layer in order to estimate the PHQ score. With this multimodal approach, it was possible to achieve the CCC score of 0.696 on the development setand 0.403 on the testing set of the E-DAIC corpus, which shows an absolute improvement of 0.283 points from the challenge baseline.

Home

各種検索

サポート

T2R2について

関連リンク

論文・著書情報