Despite the recent progress in speech recognition,
meeting speech recognition is still a challenging task, since it is
often difficult to separate one speaker’s voice from the others in
meetings. In this paper, we propose a joint training framework
of speaker separation and speech recognition with multi-channel
recordings for this purpose. The location of each speaker is first
estimated and then used to recover her/his original speech in
a delay-and-subtraction (DAS) algorithm. The two components,
speaker separation and speech recognition, are represented by
one deep net, which is optimized as a whole using training data.
We evaluated our method using simulated data generated from
WSJCAM0 database. Compared with the independent training
of the two components, our proposed method improved word
accuracy by 15.2% when the locations of speakers are known,
and by 53.6% when the locations of speakers are unknown