Audio-visual speech modeling for continuous speech recognition
Top Cited Papers
- 1 September 2000
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Multimedia
- Vol. 2 (3) , 141-151
- https://doi.org/10.1109/6046.865479
Abstract
This paper describes a speech recognition system that uses both acoustic and visual speech information to improve recognition performance in noisy environments. The system consists of three components: a visual module; an acoustic module; and a sensor fusion module. The visual module locates and tracks the lip movements of a given speaker and extracts relevant speech features. This task is performed with an appearance-based lip model that is learned from example images. Visual speech features are represented by contour information of the lips and grey-level information of the mouth area. The acoustic module extracts noise-robust features from the audio signal. Finally the sensor fusion module is responsible for the joint temporal modeling of the acoustic and visual feature streams and is realized using multistream hidden Markov models (HMMs). The multistream method allows the definition of different temporal topologies and levels of stream integration and hence enables the modeling of temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multistream models. The superior performance for the proposed system is demonstrated on a large multispeaker database of continuously spoken digits. On a recognition task at 15 dB acoustic signal-to-noise ratio (SNR), acoustic perceptual linear prediction (PLP) features lead to 56% error rate, noise robust RASTA-PLP (relative spectra) acoustic features to 7.2% error rate and combined noise robust acoustic features and visual features to 2.5% error rateKeywords
This publication has 34 references indexed in Scilit:
- Hidden Markov model decomposition of speech and noisePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Weighted Viterbi algorithm and state duration modelling for speech recognition in noisePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Speech intelligibility in the presence of cross-channel spectral asynchronyPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- An image transform approach for HMM based automatic lipreadingPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Nonlinear manifold learning for visual speech recognitionPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2002
- Computer lipreading for improved accuracy in automatic speech recognitionIEEE Transactions on Speech and Audio Processing, 1996
- Connectionist Speech RecognitionPublished by Springer Nature ,1994
- Perceiving asynchronous bimodal speech in consonant-vowel and vowel syllablesSpeech Communication, 1993
- Neural network models of sensory integration for improved vowel recognitionProceedings of the IEEE, 1990
- Two-level DP-matching--A dynamic programming-based pattern matching algorithm for connected word recognitionIEEE Transactions on Acoustics, Speech, and Signal Processing, 1979