Multitask Learning of Speaker Separation and Direction-of-Arrival Estimation

Roland Hartanto; Sakriani Sakti; Koichi Shinoda

Publication Information

Title

Japanese:
English:	Multitask Learning of Speaker Separation and Direction-of-Arrival Estimation

Author

Japanese:	Hartanto Roland, Sakriani Sakti, 篠田浩一.
English:	Roland Hartanto, Sakriani Sakti, Koichi Shinoda.

Language

English

Journal/Book name

Japanese:	日本音響学会第151回(2024年春季)研究発表会講演論文集
English:

Volume, Number, Page

pp. 69-70

Published date

Mar. 2024

Publisher

Japanese:	一般社団法人日本音響学会
English:	Acoustical Society of Japan

Conference name

Japanese:	日本音響学会第151回(2024年春季)研究発表会
English:

Conference site

Japanese:	東京都文京区
English:

File

Official URL

chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://acoustics.jp/cms/wp_asj/wp-content/uploads/004_2024spring_program.pdf

Abstract

Speech separation is the process of separating individual speaker voices from a mixture of multiple speakers' voices. Speech separation techniques have been developed for monaural and multichannel speech processing. Multichannel separation utilizes spectral and spatial information of speech sources, which help improve separation performance. Deep learning-based speech separation techniques have been extensively studied. Permutation Invariant Training (PIT) is commonly used in speech separation model training. It trains the model by minimizing separation loss over all possible output-target permutations. However, this technique is costly as the number of speakers increases. A previous work called Location-Based Training (LBT) attempted to utilize the direction-of-arrival (DOA) of speakers to support separation model training. It solves the permutation problem by ordering the target speech according to their DOA for loss calculation and performs better than PIT. However, LBT does not consider the cycle of DOA, which may cause confusion when assigning separation outputs because a source located between 0-90 degrees is considered distant from one located between 270-360 degrees. Our work explores the use of sound sources' DOA to improve speaker separation. To solve the aforementioned problems, we employ multitask learning of speaker separation and DOA estimation. The DOA information of each speaker is explicitly used in the multitask loss calculation as supervision in addition to the target speech.

Home

Search

Support

About T2R2

Related Links

Publication Information