Deep neural networks (DNN) used for acoustic modeling in speech recognition often have a very large
number of output units corresponding to context dependent (CD) triphone HMM states. The amount of
data available for speaker adaptation is often limited so a large majority of these CD states may
not be observed during adaptation. In this case, the posterior probabilities of unseen CD states are
only pushed towards zero during DNN speaker adaptation and the ability to predict these states can
be degraded relative to the speaker independent network. We address this problem by appending an
additional output layer which maps the original set of DNN output classes to a smaller set of
phonetic classes (e.g. monophones) thereby reducing the occurrences of unseen states in the
adaptation data. Adaptation proceeds by backpropagation of errors from the new output layer, which
is disregarded at recognition time when posterior probabilities over the original set of CD states
are used. We demonstrate the benefits of this approach over adapting the network with the original
set of CD states using experiments on a Japanese voice search task and obtain 5.03% relative
reduction in character error rate with approximately 60 seconds of adaptation data.