Fast speaker adaptation combined with soft vector quantization in an HMM speech recognition system

Abstract
The authors describe a method for combining speaker adaptation by feature vector transformation with semi-continuous hidden Markov modeling (SCHMM). Since the reference speaker's voice is represented in the SCHMM system by multidimensional Gaussian distributions, it is these distributions rather than feature vectors that must be transformed. The performance of hard-decision vector quantization (HVQ), soft-decision VQ (SVQ), and SCHMM are compared as are the speaker-adaptive and speaker-independent systems. In addition, the influence of dynamic features is investigated. The definition of subword units is optimized, and, with respect to full or diagonal covariance matrices and codebook size, the SCHMM system is optimized. Model initialization and distribution reestimation during training is introduced. Significant improvements are obtained compared to previously reported systems based on HVQ: from 71.6% to 84.6% (speaker-independent) and from 80.4% to 87.4% (speaker-adaptive) mean recognition rate under difficult conditions.

This publication has 4 references indexed in Scilit: