Home >

news Help

Publication Information


Title
Japanese: 
English:Multitask Learning of Speaker Separation and Direction-of-Arrival Estimation 
Author
Japanese: Hartanto Roland, Sakriani Sakti, 篠田 浩一.  
English: Roland Hartanto, Sakriani Sakti, Koichi Shinoda.  
Language English 
Journal/Book name
Japanese:日本音響学会第151回(2024年春季)研究発表会 講演論文集 
English: 
Volume, Number, Page         pp. 69-70
Published date Mar. 2024 
Publisher
Japanese:一般社団法人日本音響学会 
English:Acoustical Society of Japan 
Conference name
Japanese:日本音響学会第151回(2024年春季)研究発表会 
English: 
Conference site
Japanese:東京都文京区 
English: 
File
Official URL chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://acoustics.jp/cms/wp_asj/wp-content/uploads/004_2024spring_program.pdf
 
Abstract Speech separation is the process of separating individual speaker voices from a mixture of multiple speakers' voices. Speech separation techniques have been developed for monaural and multichannel speech processing. Multichannel separation utilizes spectral and spatial information of speech sources, which help improve separation performance. Deep learning-based speech separation techniques have been extensively studied. Permutation Invariant Training (PIT) is commonly used in speech separation model training. It trains the model by minimizing separation loss over all possible output-target permutations. However, this technique is costly as the number of speakers increases. A previous work called Location-Based Training (LBT) attempted to utilize the direction-of-arrival (DOA) of speakers to support separation model training. It solves the permutation problem by ordering the target speech according to their DOA for loss calculation and performs better than PIT. However, LBT does not consider the cycle of DOA, which may cause confusion when assigning separation outputs because a source located between 0-90 degrees is considered distant from one located between 270-360 degrees. Our work explores the use of sound sources' DOA to improve speaker separation. To solve the aforementioned problems, we employ multitask learning of speaker separation and DOA estimation. The DOA information of each speaker is explicitly used in the multitask loss calculation as supervision in addition to the target speech.

©2007 Institute of Science Tokyo All rights reserved.