Speech separation is the process of separating individual speaker voices from a mixture of multiple speakers' voices. Speech separation techniques have been developed for monaural and multichannel speech processing. Multichannel separation utilizes spectral and spatial information of speech sources, which help improve separation performance.
Deep learning-based speech separation techniques have been extensively studied. Permutation Invariant Training (PIT) is commonly used in speech separation model training. It trains the model by minimizing separation loss over all possible output-target permutations. However, this technique is costly as the number of speakers increases. A previous work called Location-Based Training (LBT) attempted to utilize the direction-of-arrival (DOA) of speakers to support separation model training. It solves the permutation problem by ordering the target speech according to their DOA for loss calculation and performs better than PIT. However, LBT does not consider the cycle of DOA, which may cause confusion when assigning separation outputs because a source located between 0-90 degrees is considered distant from one located between 270-360 degrees.
Our work explores the use of sound sources' DOA to improve speaker separation. To solve the aforementioned problems, we employ multitask learning of speaker separation and DOA estimation. The DOA information of each speaker is explicitly used in the multitask loss calculation as supervision in addition to the target speech.