Depression is a complex mental disorder that has been widely studied. Using various machine learning
techniques, researchers have been able to predict whether an individual is healthy or experiencing depression. The
most common approach involves analyzing a person’s voice and speech content. Multimodal approaches improve
prediction accuracy by incorporating facial features. With the rising popularity of large language models (LLMs),
these models have recently been applied to evaluate symptoms of depression. In this paper, we explore the use of
LLMs combined with audio and facial features for binary classification of depression using a Japanese dataset. The
current method obtained an average accuracy of 0.7956 using the DSM-5 labeling with a 5-fold cross-validation.