This study combines a Gaussian mixture model support vector
machine (GMM-SVM) system with a nonlinear feature transformation,
discriminatively trained to extract speaker specific
features from MFCCs. Separation of the speaker information
component and non-speaker related information in the speech
signal is accomplished using a regularized siamese deep network
(RSDN). RSDN learns a hidden representation that well
characterizes speaker information by training a subset of the
hidden units using pairs of speech segments. MFCC features
are input to a trained RSDN and a subset of hidden layer outputs
are used as new input features in a GMM-SVM system. We
demonstrate the potential of this approach for text-independent
speaker verification by applying it to a subset of the NIST SRE
2006 1conv4w-1conv4w task. The hybrid RSDN GMM-SVM
system achieves about 5% relative improvement over the baseline
GMM-SVM system.
Index Terms: speaker verification, neural networks, feature extraction,
GMM-SVM