Multitask Training of Multi-channel Speaker Separation and Room Acoustic Parameter Estimation

Roland Hartanto; Sakriani Sakti; Koichi Shinoda

Publication Information

Title

Japanese:
English:	Multitask Training of Multi-channel Speaker Separation and Room Acoustic Parameter Estimation

Author

Japanese:	HartantoRoland, Sakriani Sakti, 篠田浩一.
English:	Roland Hartanto, Sakriani Sakti, Koichi Shinoda.

Language

English

Journal/Book name

Japanese:	日本音響学会第153回(2025年春季)研究発表会_講演論文集
English:

Volume, Number, Page

pp. 233-234

Published date

Mar. 3, 2025

Publisher

Japanese:	一般社団法人日本音響学会
English:

Conference name

Japanese:	日本音響学会第153回(2025年春季)研究発表会
English:

Conference site

Japanese:	埼玉県
English:

Official URL

https://acoustics.jp/annualmeeting/program/

Abstract

Speaker separation focuses on extracting individual speech signals from a speech mixture. It is applied for single and multi-channel front-end speech processing to deal with overlapping speech. Multichannel separation leverages spectral and spatial information of speakers, improving separation. Deep learning methods for multi-channel speech separation have been widely explored. Permutation Invariant Training (PIT) is an approach for training speech separation models, minimizing separation loss across all possible output-target pair permutations. Other studies show that using location information can help improve separation. For example, Location-Based Training (LBT) leverages the direction-of-arrival (DoA) of speakers to organize target speech based on their DoA for loss computation and performs better than PIT. MSDET performs multitask learning of speaker separation and DoA estimation, further improving the separation. However, speaker locations are insufficient to handle various acoustic conditions. In real environments, many parameters can affect acoustic conditions, such as room size, wall surface materials, microphone array locations, and speaker locations. This work proposes simultaneously learning the speaker separation task with room acoustics parameters estimation, speaker localization, and microphone array localization, exploiting room acoustics information to improve separation in various acoustic conditions. Separation models implicitly learn room acoustics. Multitask learning allows explicit supervision to learn room acoustic parameters, improving separation. Our method separates speech from room acoustic features, capturing reverberation information. Better separation improves the estimation of room acoustic parameters.

Home

Search

Support

About T2R2

Related Links

Publication Information