Convolutional density estimation in hidden Markov models for speech recognition

Abstract
In continuous density hidden Markov models (HMMs) for speech recognition, the probability density function (PDF) for each state is usually expressed as a mixture of Gaussians. We present a model in which the PDF is expressed as the convolution of two densities. We focus on the special case where one of the convolved densities is a M-Gaussian mixture, and the other is a mixture of N impulses. We present the reestimation formulae for the parameters of the M/spl times/N convolutional model, and suggest two ways for initializing them, the residual K-Means approach, and the deconvolution from a standard HMM with MN Gaussians per state using a genetic algorithm to search for the optimal assignment of Gaussians. Both methods result in a compact representation that requires only /spl Oscr/(M+N) storage space for the model parameters, and O(MN) time for training and decoding. We explain how the decoding time can be reduced to O(M+kN), where k<M. Finally, results are shown on the 1996 Hub-4 Development test, demonstrating that a 32/spl times/2 convolutional model can achieve performance comparable to that of a standard 64-Gaussian per state model.

This publication has 3 references indexed in Scilit: