SepVAC: Multitask Learning of Speaker Separation, Speaker Localization, Microphone Array Localization, and Room Acoustic Parameter Estimation in Various Acoustic Conditions
This paper proposes a multitask learning method for speech separation, that Separates speech and estimates the recording conditions in Various Acoustic Conditions (SepVAC) jointly. Unlike the previous methods that aim to achieve robustness against the uncertainty caused by noise and reverberation, this
method explicitly estimates speaker & microphone location and room acoustic parameters to disambiguate them from speech features. We introduce curriculum learning to learn the model parameters stably. In our evaluation using SMS-WSJ-Plus dataset, it outperforms the state-of-the-art SpatialNet baseline by 0.67 points in word error rate (WER).